Converting FASTQs into an unmapped BAM: Index out of range

george-butler commented 2 years ago

Hi all,

Firstly, thank you for creating an excellent pipeline and such a versatile set of tools! I have used your reconstruct.ipynb notebook extensively and it has been great!

However, I am having some issues with the preprocess.pynb notebook, specifically with the initial conversion from fastqs to an unmapped .bam. The fastq files that I am using are from your previous Quinn et al paper but unfortunately when I am trying to use the convert_fastqs_to_unmapped_bam() function I am met with the following error:

File "/home/george/anaconda3/envs/lineage_tracing/lib/python3.7/site-packages/ngs_tools/chemistry/Chemistry.py", line 129, in parse raise IndexError('string index out of range') IndexError: string index out of range

I know that this is probably just a stupid mistake on my part but I can't work out where I am going wrong.

If you have any suggestions it would be greatly appreciated!

Thanks George

mattjones315 commented 2 years ago

Hi George,

Thanks for using Cassiopeia! This sounds like you might be using the wrong 10X chemistry setting for processing the FASTQs. Can you let me know how you're invoking the function convert_fastqs_to_unmapped_bam()?

Some more details on what I suspect the error to be: you typically produce two paired-end reads from the 10X cDNA libraries, where R1 contains the UMI+cellBC and R2 contains the gene sequence (in our case, the target site). The issue at hand comes from the different R1 structures between the 10X v2 and v3 chemistries: in particular, v3 has a 12nt UMI barcode and v2 has 10nt UMI (for more information on the read structure, this is a nice tutorial).

The reason why this can be an issue is that the Quinn et al dataset was generated with 10X v2 chemistry which has a 10nt UMI. But, if you are running convert_fastqs_to_unmapped_bam with the v3 chemistry setting you'll run into an IndexError because it is expected a longer R1 than what you actually have.

So, the tl;dr will be to make sure that you're setting chemistry='10xv2' in that function call. You can also check out our documentation website to see what other chemistries are supported.

Hope this helps and let me know if you run into any other issues!

Best, Matt

george-butler commented 2 years ago

Hi Matt,

Thank you for the quick response! Yes you are 100% correct I was using the wrong chemistry and now everything is running smoothly.

Thank you once again for developing a great pipeline and providing invaluable support.

Thanks George

mattjones315 commented 2 years ago

Glad to hear that worked and please don't hesitate to reach out with other questions!

YosefLab / Cassiopeia

Converting FASTQs into an unmapped BAM: Index out of range #185