bvaldebenitom / SoloTE

GNU General Public License v3.0
27 stars 6 forks source link

what samtools version did you guys use? #3

Closed hoangmgh closed 1 year ago

hoangmgh commented 1 year ago

Thanks for this great tools! I like the idea of single cell TEs.

However, would you be able to tell me what specific version of samtools did you used for the sake of reproducible results? While running this tool, I found out the following errors:

bvaldebenitom commented 1 year ago

Hi @hoangmgh,

thanks for the detailed feedback.

As for Samtools, we require at least version 1.11, which has the -d flag available.

Do you have any BAM file you can share that has empty GN tags?

hoangmgh commented 1 year ago

I aligned the reads using Cumulus's STAR solo pipeline, which I think is the latest version of STARsolo. So in this file the GN tags are present for all reads: NB551582:70:HF2H2BGXJ:1:21110:18706:5729 0 chr1 10048 0 58M 0 0 CTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACACTAACCCTAAAACTAA A666A////A/////AAEA//6EEE6AAEAA//A/<A/AEE/A/EEA/E/AA//AEAA CR:Z:ACACCCTCATACACCC UR:Z:TAACCCTTAG CY:Z:AAAAAEEEEE6EEEEA UY:Z:EEEEEEAEEE NH:i:9 HI:i:1 nM:i:3 AS:i:51 GX:Z:- GN:Z:- sS:Z:ACACCCTCATACACCCTAACCCTTAG sQ:Z:AAAAAEEEEE6EEEEAEEEEEEAEEE sM:i:-1 CB:Z:- UB:Z:- NB551582:70:HF2H2BGXJ:3:23509:13066:10745 16 chr1 10049 0 58M 0 0 TAACCATATCCTTATACCTTACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAC E/////EA//E/EE//////EEEEEEEEE6EEEEEEEEEEEEEEEEEEEEEEEAAAAA CR:Z:TATGCCCCAGCGTTCG UR:Z:CCTTCCATAG CY:Z:AAAAAEEA/AAEEEEE UY:Z:EEEEEEEEE6 NH:i:23 HI:i:1 nM:i:6 AS:i:45 GX:Z:- GN:Z:- sS:Z:TATGCCCCAGCGTTCGCCTTCCATAG sQ:Z:AAAAAEEA/AAEEEEEEEEEEEEEE6 sM:i:0 CB:Z:TATGCCCCAGCGTTCG UB:Z:CCTTCCATAG NB551582:70:HF2H2BGXJ:4:21504:15822:2416 16 chr1 10049 0 58M 0 0 TAACCCTTAACCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAC </A/AE////AAAE/////EEA/AEEEEEE6EEEEEEEEEEEEEEEEEEEEEEAAAAA CR:Z:TCAATCTGTGGACGAT UR:Z:AACTGCTTGC CY:Z:AAAAAEEEEEEEEEEE UY:Z:EEEEEEEEEE NH:i:29 HI:i:1 nM:i:2 AS:i:53 GX:Z:- GN:Z:- sS:Z:TCAATCTGTGGACGATAACTGCTTGC sQ:Z:AAAAAEEEEEEEEEEEEEEEEEEEEE sM:i:0 CB:Z:TCAATCTGTGGACGAT UB:Z:AACTGCTTGC NB551582:70:HF2H2BGXJ:2:23211:18700:11761 0 chr1 10052 0 56M2S 0 0 CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAACCTAACCCTAAACCTAACCAT AAAAAEEEEEEEEEEEEEEEAEEEEAEEEE/<//A</////<////EE////6//<// CR:Z:AGCGTCGCAGCGTTCG UR:Z:TTCAGCGGCA CY:Z:AAAAAEEEEEEEEEEE UY:Z:EEEEEEEEEE NH:i:33 HI:i:1 nM:i:2 AS:i:51 GX:Z:- GN:Z:- sS:Z:AGCGTCGCAGCGTTCGTTCAGCGGCA sQ:Z:AAAAAEEEEEEEEEEEEEEEEEEEEE sM:i:0 CB:Z:AGCGTCGCAGCGTTCG UB:Z:TTCAGCGGCA NB551582:70:HF2H2BGXJ:3:23510:21768:13541 16 chr1 10052 0 1S57M * 0 0 TCCCTACCCATCCCCCTAACCCTAAACCTAACCCTAACCCTAACCCTAACCCTAACCC /EA/EE/E//E//EE/EE///6E/A6AEAEEAE6E6EEEEEEEEEEEEEEEEEAAAAA CR:Z:TATGCCCCAGCGTTCG UR:Z:CCTTCCATAG CY:Z:AAAAAEEEEEEEEEEE UY:Z:EEEEEEEEEE NH:i:14 HI:i:1 nM:i:5 AS:i:46 GX:Z:- GN:Z:- sS:Z:TATGCCCCAGCGTTCGCCTTCCATAG sQ:Z:AAAAAEEEEEEEEEEEEEEEEEEEEE sM:i:0 CB:Z:TATGCCCCAGCGTTCG UB:Z:CCTTCCATAG

I think this might have been the result of different STARsolo version.

hoangmgh commented 1 year ago

I think it is not empty value GN, but rather some of them go with a dash, i.e. not assigned to any genes

bvaldebenitom commented 1 year ago

Ok, thanks for sharing that information.

In the Cumulus STAR solo BAM file, are there any reads not having the GN tag, or are either GN:Z:- or GN:Z:GENENAME ?

hoangmgh commented 1 year ago

None of the reads have empty GN tags (which is why I think the te_file is empty)

biohiran commented 1 year ago

Hi ,

I also encountered the same issue. my "out_test_nogenes.bam" file is empty as I do have GN:-, i.e. GN tag with no gene info.

Usually GN indicates Semicolon-separated list of gene names that are compatible with this alignment. Gene names are specified with gene_name key in the reference GTF attribute column. Therefore , "GN:-" represents reads NOT mapped to any gene and we may need to result these into "out_test_nogenes.bam". Is my understanding correct ?

bvaldebenitom commented 1 year ago

@biohiran your understanding is correct, and it seems to be the default behavior of STAR now. We are currently working on a fix to consider these cases.

hoangmgh commented 1 year ago

a future feature would be quantifying TEs that overlap the gene body as well (beside the only non-gene mode). It would be interesting to see if the EM can effectively correct the reads.

bvaldebenitom commented 1 year ago

@biohiran @hoangmgh The GN tag issue is solved now with Release 1.02

hoangmgh commented 1 year ago

thanks you guys for the work!

bvaldebenitom commented 1 year ago

thanks you guys for the work!

Please let us know if you try it.

@hoangmgh Regarding your other comment, did you mean that we also include the TEs inside the gene body?

hoangmgh commented 1 year ago

That s what I mean - I personally believe intronic TEs are meaningful as well (and in fact make up more reads than the intergenic/non-gene counts)

bvaldebenitom commented 1 year ago

Actually SoloTE already considers intronic TEs. And indeed we found a couple for our paper!

However, it doesn't consider TEs that are fully within the exon body.