bvaldebenitom / SoloTE

GNU General Public License v3.0
23 stars 6 forks source link

additional dependencies? #37

Closed coraolpe closed 3 weeks ago

coraolpe commented 7 months ago

Hi, Thank you for a cool tool! I was just curious... when I tried to run the .py script to get the repeat masker file I needed to also install the following: requests colorama tqdm

This wasn't mentioned in the dependencies section of the README ... are these super standard or am I missing something? Thanks a lot for clarifying :) best wishes Cora

bvaldebenitom commented 7 months ago

Hi @coraolpe,

I'll add them to the dependencies list since they don't seem to be standard.

Were you able to run the script after installing them?

coraolpe commented 7 months ago

Hi, yes, it ran though unfortunately with an error but I think it's an issue with my bam file, I'm working on it!

coraolpe commented 7 months ago

Hi again, so it did something but I am getting the following output: SoloTE started at 15:56:09 [OK] samtools found! [OK] bedtools found! SoloTE v1.09 started! SoloTE Home directory /scratch/colpe/transposons/scripts SoloTE executed from /scratch/colpe/transposons/scripts Results will be stored in /home/colpe/scratch/transposons/SoloTE_output Input BAM file: /home/colpe/scratch/transposons/bam_files/plate_v003_Aligned.sortedByCoord.out.bam Input TE BED file: /home/colpe/scratch/transposons/references/mm10_rmsk.bed Currently working in temporary directory: /home/colpe/scratch/transposons/SoloTE_output/plate3_SoloTESoloTE_temp samtools view -@ 6 -O BAM -o plate3_SoloTEnogenes_overlappingtes.bam -L /home/colpe/scratch/transposons/references/mm10_rmsk.bed -e '(exists([CB]) && exists([UB]) && [CB]!="-" && [UB]!="-") && (!exists([GN]) || [GN]=="-")' /home/colpe/scratch/transposons/bam_files/plate_v003_Aligned.sortedByCoord.out.bam samtools index plate3_SoloTEnogenes_overlappingtes.bam bedtools bamtobed -i plate3_SoloTEnogenes_overlappingtes.bam -split > plate3_SoloTEnogenes_overlappingtes.bed bedtools intersect -a /home/colpe/scratch/transposons/references/mm10_rmsk.bed -b plate3_SoloTEnogenes_overlappingtes.bed -u > plate3_SoloTEselectedtes.bed python /scratch/colpe/transposons/scripts/annotateBAM.py plate3_SoloTEnogenes_overlappingtes.bam plate3_SoloTEselectedtes.bed temp_annotated_te.bam 1 samtools sort -@ 6 -O BAM -o plate3_SoloTEteannotated.bam temp_annotated_te.bam samtools merge --threads 6 -o - /home/colpe/scratch/transposons/bam_files/plate_v003_Aligned.sortedByCoord.out.bam plate3_SoloTEteannotated.bam|samtools view -@ 6 -O BAM -o plate3_SoloTEfinal.bam -e 'exists([CB]) && exists([UB]) && exists([GN]) && [CB]!="-" && [UB]!="-" && [GN]!="-"' --keep-tag GN,CB,UB samtools index plate3_SoloTE__final.bam Counts for chromosome 14 are being generated in process: 2403890 Counts for chromosome 15 are being generated in process: 2403890 Counts for chromosome 4 are being generated in process: 2403890 Counts for chromosome Y are being generated in process: 2403890 Counts for chromosome GL456233.1 are being generated in process: 2403890 Counts for chromosome JH584294.1 are being generated in process: 2403890 Counts for chromosome JH584304.1 are being generated in process: 2403890 Counts for chromosome GL456216.1 are being generated in process: 2403890 Counts for chromosome JH584292.1 are being generated in process: 2403890 Counts for chromosome JH584295.1 are being generated in process: 2403890 Counts for chromosome 13 are being generated in process: 2403889 Counts for chromosome 16 are being generated in process: 2403889 Counts for chromosome 3 are being generated in process: 2403889 Counts for chromosome 8 are being generated in process: 2403889 Counts for chromosome 11 are being generated in process: 2403887 Counts for chromosome 19 are being generated in process: 2403887 Counts for chromosome 7 are being generated in process: 2403887 Counts for chromosome 12 are being generated in process: 2403888 Counts for chromosome 18 are being generated in process: 2403888 Counts for chromosome 5 are being generated in process: 2403888 Counts for chromosome X are being generated in process: 2403888 Counts for chromosome 1 are being generated in process: 2403885 Counts for chromosome 2 are being generated in process: 2403885 Counts for chromosome 9 are being generated in process: 2403885 Counts for chromosome 10 are being generated in process: 2403886 Counts for chromosome 17 are being generated in process: 2403886 Counts for chromosome 6 are being generated in process: 2403886 Counts for chromosome MT are being generated in process: 2403886 Traceback (most recent call last): File "/home/colpe/data/conda/envs/transposon_env/lib/python3.10/site-packages/pandas/core/indexes/range.py", line 414, in get_loc return self._range.index(new_key) ValueError: 4 is not in range

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/scratch/colpe/transposons/scripts/SoloTE_pipeline.py", line 217, in tecounts2.loc[tecounts2[4].isnull(),4] = tecounts2.loc[tecounts2[4].isnull(),1] File "/home/colpe/data/conda/envs/transposon_env/lib/python3.10/site-packages/pandas/core/frame.py", line 3893, in getitem indexer = self.columns.get_loc(key) File "/home/colpe/data/conda/envs/transposon_env/lib/python3.10/site-packages/pandas/core/indexes/range.py", line 416, in get_loc raise KeyError(key) from err KeyError: 4 Could you advise please?

bvaldebenitom commented 7 months ago

Can you share the output of ls -lht /home/colpe/scratch/transposons/SoloTE_output/plate3_SoloTE__SoloTE_temp ?

coraolpe commented 7 months ago

Hi, yes, sure, it's this: total 373M -rw-rw-r-- 1 colpe colpe 22M Nov 28 16:00 plate3_SoloTEallcounts.txt -rw-rw-r-- 1 colpe colpe 69K Nov 28 16:00 plate3_SoloTEcountpercell_MT.counts -rw-rw-r-- 1 colpe colpe 1.3M Nov 28 16:00 plate3_SoloTEcountpercell_9.counts -rw-rw-r-- 1 colpe colpe 680K Nov 28 16:00 plate3_SoloTEcountpercell_X.counts -rw-rw-r-- 1 colpe colpe 1.6M Nov 28 16:00 plate3_SoloTEcountpercell_7.counts -rw-rw-r-- 1 colpe colpe 1.2M Nov 28 16:00 plate3_SoloTEcountpercell_8.counts -rw-rw-r-- 1 colpe colpe 78 Nov 28 16:00 plate3_SoloTEcountpercell_JH584295.1.counts -rw-rw-r-- 1 colpe colpe 40 Nov 28 16:00 plate3_SoloTEcountpercell_JH584292.1.counts -rw-rw-r-- 1 colpe colpe 650 Nov 28 16:00 plate3_SoloTEcountpercell_GL456216.1.counts -rw-rw-r-- 1 colpe colpe 3.8K Nov 28 16:00 plate3_SoloTEcountpercell_JH584304.1.counts -rw-rw-r-- 1 colpe colpe 38 Nov 28 16:00 plate3_SoloTEcountpercell_JH584294.1.counts -rw-rw-r-- 1 colpe colpe 1.2K Nov 28 16:00 plate3_SoloTEcountpercell_GL456233.1.counts -rw-rw-r-- 1 colpe colpe 4.6K Nov 28 16:00 plate3_SoloTEcountpercell_Y.counts -rw-rw-r-- 1 colpe colpe 1.3M Nov 28 16:00 plate3_SoloTEcountpercell_4.counts -rw-rw-r-- 1 colpe colpe 1.6M Nov 28 16:00 plate3_SoloTEcountpercell_5.counts -rw-rw-r-- 1 colpe colpe 1.1M Nov 28 16:00 plate3_SoloTEcountpercell_6.counts -rw-rw-r-- 1 colpe colpe 1.7M Nov 28 16:00 plate3_SoloTEcountpercell_2.counts -rw-rw-r-- 1 colpe colpe 1.1M Nov 28 16:00 plate3_SoloTEcountpercell_3.counts -rw-rw-r-- 1 colpe colpe 697K Nov 28 16:00 plate3_SoloTEcountpercell_19.counts -rw-rw-r-- 1 colpe colpe 996K Nov 28 16:00 plate3_SoloTEcountpercell_17.counts -rw-rw-r-- 1 colpe colpe 665K Nov 28 16:00 plate3_SoloTEcountpercell_18.counts -rw-rw-r-- 1 colpe colpe 885K Nov 28 16:00 plate3_SoloTEcountpercell_15.counts -rw-rw-r-- 1 colpe colpe 723K Nov 28 16:00 plate3_SoloTEcountpercell_16.counts -rw-rw-r-- 1 colpe colpe 1.5M Nov 28 16:00 plate3_SoloTEcountpercell_1.counts -rw-rw-r-- 1 colpe colpe 1.7M Nov 28 16:00 plate3_SoloTEcountpercell_11.counts -rw-rw-r-- 1 colpe colpe 853K Nov 28 16:00 plate3_SoloTEcountpercell_12.counts -rw-rw-r-- 1 colpe colpe 1.1M Nov 28 16:00 plate3_SoloTEcountpercell_10.counts -rw-rw-r-- 1 colpe colpe 828K Nov 28 16:00 plate3_SoloTEcountpercell_13.counts -rw-rw-r-- 1 colpe colpe 772K Nov 28 16:00 plate3_SoloTEcountpercell_14.counts -rw-rw-r-- 1 colpe colpe 1.6M Nov 28 16:00 plate3_SoloTE__final.bam.bai -rw-rw-r-- 1 colpe colpe 328M Nov 28 16:00 plate3_SoloTEfinal.bam -rw-rw-r-- 1 colpe colpe 1.9K Nov 28 15:58 plate3_SoloTEteannotated.bam -rw-rw-r-- 1 colpe colpe 1.8K Nov 28 15:58 temp_annotated_te.bam -rw-rw-r-- 1 colpe colpe 0 Nov 28 15:57 plate3_SoloTE__selectedtes.bed -rw-rw-r-- 1 colpe colpe 0 Nov 28 15:57 plate3_SoloTEnogenes_overlappingtes.bed -rw-rw-r-- 1 colpe colpe 544 Nov 28 15:57 plate3_SoloTEnogenes_overlappingtes.bam.bai -rw-rw-r-- 1 colpe colpe 1.8K Nov 28 15:57 plate3_SoloTEnogenes_overlappingtes.bam

coraolpe commented 7 months ago

Hi again, I would be most grateful if you could advise on my issue please. Best wishes Cora

bvaldebenitom commented 7 months ago

Thanks for kindly following up!

Based on the above files, it looks like plate3_SoloTE__selectedtes.bed and plate3_SoloTE__nogenes_overlappingtes.bed are empty. Please share back the output of the following commands to further debug this issue:

samtools view /home/colpe/scratch/transposons/bam_files/plate_v003_Aligned.sortedByCoord.out.bam|head head /home/colpe/scratch/transposons/references/mm10_rmsk.bed

coraolpe commented 7 months ago

Hi thank you! Output from the bam file: VH01503:16:AAC3W5CHV:1:1101:48036:5809 16 1 3000469 255 59M1S 0 0 TTGTTTCCACTTGGTTGATTTCAGCTCTGAGTTTGATTATTTCCTGCTGTCTACTCATCC CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC;CCCCCCCCCCCCCCCCCCCC NH:i:1 HI:i:1 nM:i:0 AS:i:58 CR:Z:TTTTCTCC UR:Z:CGGGAG GX:Z:- GN:Z:- sS:Z:CGGGAGTTTTCTCCTTTTTCATTTCT sQ:Z:CCCCCCCCCCCCCCCCCCCCCCCCCC sM:i:1 CB:Z:TTTTCGCC UB:Z:CGGGAG VH01503:16:AAC3W5CHV:2:1508:38549:9084 16 1 3000469 255 59M1S 0 0 TTGTTTCCACTTGGTTGATTTCAGCTCTGAGTTTGATTATTTCCTGCTGTCTACTCATCC CCCCCCCC;CCCCC;CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC NH:i:1 HI:i:1 nM:i:0 AS:i:58 CR:Z:TTTTCTCC UR:Z:CGGGAG GX:Z:- GN:Z:- sS:Z:CGGGAGTTTTCTCCTTTTTCATTTCT sQ:Z:CCCCCCCCCCCCCC-CCCCCCCCCC; sM:i:1 CB:Z:TTTTCGCC UB:Z:CGGGAG VH01503:16:AAC3W5CHV:1:2503:27907:35702 16 1 3001688 0 56M4S 0 0 CATGGAATACTTTGGTTTCTCCATCTATGGTAATTGAGAGTTTGGCTGGGTATAGCCCCC CCCCCCCCCCC;CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC- NH:i:6 HI:i:1 nM:i:0 AS:i:55 CR:Z:CGTATCTG UR:Z:TCGATT GX:Z:- GN:Z:- sS:Z:TCGATTCGTATCTGTTAGTCTATGTC sQ:Z:CCCCCCCCCCCCCCCCCCCCCCCCCC sM:i:1 CB:Z:CGTATCAG UB:Z:TCGATT VH01503:16:AAC3W5CHV:2:1111:72368:15369 16 1 3004243 3 20M164267N40M 0 0 CAACAAAAAAACACACAAAACAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA CC------------C-C;-;-C---CCC;-CCC-CCCC;CCCCCCCCCCCCCCCCCCCCC NH:i:2 HI:i:1 nM:i:0 AS:i:48 CR:Z:AATGGTGA UR:Z:GTTGGT GX:Z:- GN:Z:- sS:Z:GTTGGTAATGGTGAAAAAAAAAAAAsQ:Z:CCCCCCCCCCCCC;CCCCCCCCCCCC sM:i:1 CB:Z:AATGGTGG UB:Z:GTTGGT VH01503:16:AAC3W5CHV:2:1213:70342:36213 16 1 3004247 3 24M71156N36M 0 0 AAAAAAACACACAAAAAAAACAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAGA CC-CC-C---C-C-;;;--C--C--;-C-;--;-CC-CC;CCCCCCCCC;CCCCCCCCCC NH:i:2 HI:i:1 nM:i:6 AS:i:44 CR:Z:AAAAAAAA UR:Z:AAAAAA GX:Z:- GN:Z:- sS:Z:AAAAAAAAAAAAAAAAAAAAAAAAAsQ:Z:CCCCCCCCCCCCCCCCCCCCCCCCCC sM:i:-24 CB:Z:- UB:Z:- VH01503:16:AAC3W5CHV:2:1405:71345:27940 0 1 3004528 255 4S56M 0 0 TTTTTGCGAAAAACCTATCAGAAAAATGGCTGTTTTCTTAATGGAATCTTTAAAAACTAG CC-CCCCC;CC-CCCCCCCCC;CCCCC;CCCC-CCCCC;CCCCCCCC;;CCCC-C-C;C- NH:i:1 HI:i:1 nM:i:0 AS:i:55 CR:Z:TGATATTT UR:Z:TGTTGG GX:Z:- GN:Z:- sS:Z:TGTTGGTGATATTTGGTGGTGTTGTT sQ:Z:C-CC--CC;CCCCCCCCCCCCCC;CC sM:i:-1 CB:Z:- UB:Z:- VH01503:16:AAC3W5CHV:2:1102:18705:15483 0 1 3004529 255 1S59M 0 0 GGCGAAAAACCTATCAGAAAAATGGCTGTTTTCTTAATGGAATCTTTAAAAACTAGAGGA CCCC;-CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC;C;-CCCCCCCC NH:i:1 HI:i:1 nM:i:0 AS:i:58 CR:Z:TGATATTT UR:Z:TGTTGG GX:Z:- GN:Z:- sS:Z:TGTTGGTGATATTTGGTGGTGTTGTT sQ:Z:CCC;;CC;CCCCCC;CCCCCCCCCCC sM:i:-1 CB:Z:- UB:Z:- VH01503:16:AAC3W5CHV:2:1102:22435:26028 0 1 3004529 255 1S59M 0 0 GGCGAAAAACCTATCAGAAAAATGGCTGTTTTCTTAATGGAATCTTTAAAAACTAGAGGA CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC-CCCCCCCC NH:i:1 HI:i:1 nM:i:0 AS:i:58 CR:Z:TGATATTT UR:Z:TGTTGG GX:Z:- GN:Z:- sS:Z:TGTTGGTGATATTTGGTGGTGTTGTT sQ:Z:CCCCCCCCCCCCCCCCCCCCCCCCCC sM:i:-1 CB:Z:- UB:Z:- VH01503:16:AAC3W5CHV:2:1102:22549:26672 0 1 3004529 255 1S59M 0 0 GGCGAAAAACCTATCAGAAAAATGGCTGTTTTCTTAATGGAATCTTTAAAAACTAGAGGA CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC;CCCCC;CCCCCCCC NH:i:1 HI:i:1 nM:i:0 AS:i:58 CR:Z:TGATATTT UR:Z:TGTTGG GX:Z:- GN:Z:- sS:Z:TGTTGGTGATATTTGGTGGTGTTGTT sQ:Z:CCCCCCCCCCCCCCCCCCCCCCCCCC sM:i:-1 CB:Z:- UB:Z:- VH01503:16:AAC3W5CHV:2:2210:68013:8175 0 1 3004529 255 1S59M 0 0 GGCGAAAAACCTATCAGAAAAATGGCTGTTTTCTTAATGGAATCTTTAAAAACTAGAGGA CCCCCCCCCCCCC;CCC-CCCCCCCCCCCCCCCCCCCCCCCCCCCCCC-CCCCCCCCCCC NH:i:1 HI:i:1 nM:i:0 AS:i:58 CR:Z:TGATATTT UR:Z:TGTTGG GX:Z:- GN:Z:- sS:Z:TGTTGGTGATATTTGGTGGTGTTGTT sQ:Z:CCCCC;CCCCCCCCCCCCCCCCCCCC sM:i:-1 CB:Z:- UB:Z:- Output from the bed file: chr1 3000001 3002128 chr1|3000001|3002128|L1_Mus3:L1:LINE|10.5|- 10.5 - chr1 3003153 3003994 chr1|3003153|3003994|L1Md_F:L1:LINE|26.8|- 26.8 - chr1 3003994 3004054 chr1|3003994|3004054|L1_Mus3:L1:LINE|27.9|- 27.9 - chr1 3004041 3004206 chr1|3004041|3004206|L1_Rod:L1:LINE|19.9|+ 19.9 + chr1 3004271 3005001 chr1|3004271|3005001|L1_Rod:L1:LINE|19.9|+ 19.9 + chr1 3005002 3005439 chr1|3005002|3005439|L1_Rod:L1:LINE|22.1|+ 22.1 + chr1 3005461 3005548 chr1|3005461|3005548|Lx9:L1:LINE|22.6|+ 22.6 + chr1 3005571 3006764 chr1|3005571|3006764|Lx9:L1:LINE|22.6|+ 22.6 + chr1 3007015 3007268 chr1|3007015|3007268|L1M4:L1:LINE|28.9|- 28.9 - chr1 3008117 3008483 chr1|3008117|3008483|L1_Mur2:L1:LINE|14.8|- 14.8 -

coraolpe commented 7 months ago

Hi, I am really stuck on this and would really be grateful for some input please. I just ran a bunch of bits by themselves and noticed it gets stuck at this place: allcounts = pd.read_table(allcountsfile,header=None) tecounts = allcounts[allcounts[0].str.contains("SoloTE")] tecounts2 = tecounts[0].str.split("|",expand=True) tecounts2.loc[tecounts2[4].isnull(),4] = tecounts2.loc[tecounts2[4].isnull(),1]

because there is no "SoloTE" annotation anywhere. I am wondering if this has to do with chromosome numbering? My bam file uses 1,2,3 etc whereas the bed file has chr1, chr2 etc... what do you think? any tipps for easy conversion? I'm trying samtools reheater but no luck, it throws an error

coraolpe commented 7 months ago

Ok so I went ahead and I removed the 'chr' from the repeatmasker bed file and it ran without the error. However, I am surprised at the high % of TE UMIs and low % (0!?) of locus-specific TEs. I get this: A total of 8474452 UMIs are in the final matrix. Of these, 3331436 (39.312%) correspond to genes. and 5143016 (60.688%) correspond to TEs. TE detected UMIs are distributed as follows: Locus-specific TEs: 0 UMIs (0.0%). Subfamily TEs: 5143016 (100.0%).

I wonder if something dodgy is going on? Also. Could you please expand on which matrix to use for Seurat? I got many of them: drwxrwxr-x 2 colpe colpe 3 Dec 4 14:47 plate3_SoloTE_classtes_MATRIX drwxrwxr-x 2 colpe colpe 3 Dec 4 14:47 plate3_SoloTE_familytes_MATRIX drwxrwxr-x 2 colpe colpe 3 Dec 4 14:47 plate3_SoloTE_legacytes_MATRIX drwxrwxr-x 2 colpe colpe 3 Dec 4 14:47 plate3_SoloTE_locustes_MATRIX drwxrwxr-x 2 colpe colpe 3 Dec 4 14:47 plate3_SoloTE_subfamilytes_MATRIX and the documentation does not say at all which ones to use for downstream analysis. Thanks a lot!

bvaldebenitom commented 7 months ago

Hi @coraolpe !

I appreciate you being so proactive with the issue. This is similar to issue #2, and the recommended solution is to do: sed 's/^chr//' BEDfile > NEWBEDfile This will only remove the chr at the beginning of the TE BED file, but not the one on column 4, which is used for counting locus-specific TEs.

Afterwards, you can check that column 1 on the BED file doesn't have the "chr", and that it still appears on column 4. For example, it should look like this: 1 3000001 3002128 chr1|3000001|3002128|L1_Mus3:L1:LINE|10.5|- 10.5 -

Once this is done, delete all the files, run the pipeline again and you should see locus-specific results.

Regarding which matrix to use, please check these posts: https://github.com/bvaldebenitom/SoloTE/issues/30#issuecomment-1721876808 https://github.com/bvaldebenitom/SoloTE/issues/31#issuecomment-1724094927

I have to push some updates soon, and it is in my to-do list to add these to the documentation. Let me know if those are clear enough!

coraolpe commented 6 months ago

Hi! Sorry for taking a while to reply. So this has solved the issue and it now ran super well and I'm very excitedly analysing the output. Can I just ask something totally basic - for the SoloTE-DNA category - is it the Transposase transcripts that are being counted? Because I guess it's RNAseq data so we aren't measuring any DNA that is moving around, right?

bvaldebenitom commented 5 months ago

Hi @coraolpe,

sorry for the late reply. Thanks for the comment, and glad that it is working for you now!

Yes, since we are using RNA-Seq, we are only measuring transcripts. This is true for all TEs, not only the DNA category. However, if you are interested on potential changes at the DNA level, it can still be somewhat speculated. A strategy I recommend is that once you find a set of TEs from the DNA category with locus resolution, take their sequence and do a DNA-to-protein translation. If you find no stop codons, it may be indicative of transcripts that could eventually lead to the production of the transposase proteins, which in turn might be moving / transposing along the genome.

This will not be hard evidence for actual transposition activity, but it will allow for some speculations / predictions about it.

Hope this helps!