LuyiTian / FLAMES

Full-length transcriptome splicing and mutation analysis
GNU General Public License v3.0
69 stars 10 forks source link

match_cell_barcode qnames too long #16

Open danledinh opened 2 years ago

danledinh commented 2 years ago

Minimap2/Samtools is throwing an error from reads with append cell barcode/UMI (generated from match_cell_barcode).

[E::sam_parse1] query name too long
[W::sam_read1_sam] Parse error at line 8760987
samtools sort: truncated file. Aborting

Here is an example qname:

@CTACGGGAGAGCTTTC_CGATAAGACCCA#ACATCGAGTCAAACGG_GCACATCTTGGC#GTAGAGGAGCGGGTTA_AGGCACCTATGT#AGTACTGAGAGTCAGC_CTCAGCCAGTAA#TGTCCCAGTTACCGTA_ATCGTACCAGTC#AATCGTGTCGACATCA_ACTCAAGGCCAT#CGAGAAGGTTCGGCGT_TACGCCAGTCTG#GCTGCAGCACATGGTT_TGATTATGCCTC#CCGTAGGCAGACTGCC_CTCTCGCATACA#TAAGTCGCAGGAGGTT_TAACTATTTACG#TCGTAGATCACTACGA_AGACGCAAATTT#GTCGAATAGGTTACAA_ACAAATTGTTTC#ACAAGCTCAGGCGTTC_CGTTGCCTATAT#GTGCACGAGGATAATC_CAGGAGTCAGAA#AGGATAAAGGTATCTC_CCAATCGCTTTA#GTCATGAGTCCTCCTA_AGCTCAAACACT#GACTTCCCAAAGTATG_GCCCACTTGCTG#TGTACAGTCAACCGAT_TGAAGCATCCAC#TGAGGTTTCAAGGACG_GGACCAAGTCGG#TTACGCCCAGCCATTA_AATCACCGCTCG#ATATCCTCACAATGAA_AATTATCTCTTT#CCACACTCAATAGGGC_CACCTATTTTTT#TCTCTGGCAAACACGG_GCCCCTGCATAG#ATATCCTGTATTCCGA_AATTATGAACTT#TCCCATGGTTGCGGAA_AAATTACAATCC#AGTAGTCTCGTCTCAC_CCATGATTCACG#CTAACCCGTGGCCTCA_ATTTACAGATGA#32fd44aa-9033-40d6-a233-bf43ece68751

Looks like qname must be equal to or shorter than 254 characters: https://github.com/samtools/samtools/issues/1081

LuyiTian commented 2 years ago

oh that is weird. the format for fastq read names should be @CellBarcode_UMI#OriginalName. so there should be only one cell barcode and UMI tag in the read name. can you give me the original fastq read so I can identify what is wrong? Is this the only read that have long name or it is the same for all reads?

danledinh commented 2 years ago

Unfortunately, I can't release the FASTQ per company policy. Here are some more observations that might help you troubleshoot the problem:

1) I used an edit distance of 2 2) About 10-15% of BC-matched reads have more than 1 BC/UMI appended 3) The multiple BC/UMI reads are mostly at the end of the FASTQ file (is there a sorting step?)

I hope that helps!