caleblareau / mgatk

mgatk: mitochondrial genome analysis toolkit
http://caleblareau.github.io/mgatk
MIT License
98 stars 25 forks source link

Possible improvement on chunk_barcoded_bam.py #73

Closed ruochiz closed 1 year ago

ruochiz commented 1 year ago

Thank you for creating this useful toolkit. When running the software on a really large combined libraries (~200k cells to consider), I found the bottleneck becomes the chunk_barcoded_bam.py part, and I found possible solutions to improve it.

  1. transform cell barcode list from list to set bc = set([x.strip() for x in content]) which improves the speed of checking existence of barcodes a lot (~800 records /s -> ~100k records / s)
  2. Use pysam read.get_tag, instead of the iteration way
def getBarcode(read, tag_get):
  '''
  Parse out the barcode per-read
  '''
  # for tg in read.tags:
  #     if(tag_get == tg[0]):
  #         return(tg[1])
  # return("AA")
  try:
    read.get_tag(barcodeTag, tag_get)
  except:
    return ("AA")

This improves the speed from ~100k records/s -> 130k records/s

caleblareau commented 1 year ago

Thanks for the input! Would you be able to open a PR for this?

On Jun 25, 2023, at 1:16 PM, ruochiz @.***> wrote:



Thank you for creating this useful toolkit. When running the software on a really large combined libraries (~200k cells to consider), I found the bottleneck becomes the chunk_barcoded_bam.py part, and I found possible solutions to improve it.

  1. transform cell barcode list from list to set bc = set([x.strip() for x in content]) which improves the speed of checking existence of barcodes a lot (~800 records /s -> ~100k records / s)
  2. Use pysam read.get_tag, instead of the iteration way

def getBarcode(read, tag_get): ''' Parse out the barcode per-read '''

for tg in read.tags:

if(tag_get == tg[0]):

return(tg[1])

return("AA")

try: read.get_tag(barcodeTag, tag_get) except: return ("AA")

This improves the speed from ~100k records/s -> 130k records/s

— Reply to this email directly, view it on GitHubhttps://github.com/caleblareau/mgatk/issues/73, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AD32FYOWVXWRSYBGB5H4S6DXNCMARANCNFSM6AAAAAAZTK7Z5Q. You are receiving this because you are subscribed to this thread.Message ID: @.***>

caleblareau commented 1 year ago

now implemented in v0.6.8. Thank you very much @ruochiz for the contribution. You should be able to pip install the latest version of the software now.