bvaldebenitom / SoloTE

GNU General Public License v3.0
23 stars 6 forks source link

Run fails after 1 hour #2

Closed JBreunig closed 1 year ago

JBreunig commented 1 year ago

I tried running the commands and it failed after about an hour. Might you have any suggestions on troubleshooting from the output below? (Samtools 1.16.1; Bedtools v2.30.0; R 4.2.1)

python /mnt/Sabrent2TBRefsCR/SoloTE/SoloTE_v1/SoloTE_pipeline.py /mnt/12TBNew0821/AnatDecabitine/SS-15340--01--14--2022/FASTQ/star_out/PBS2/Aligned.sortedByCoord.out.bam 48 Tester /mnt/Sabrent2TBRefsCR/SoloTE/Mm10TEannotation.bed SoloTE started at 19:50:35 samtools found! bedtools found! ['@CO\tuser command line: /mnt/Sabrent2TBRefsCR/STARlatest0822/STAR ', 'quantMode GeneCounts ', 'soloType CB_UMI_Simple ', 'soloCBwhitelist /mnt/Sabrent2TBRefsCR/STAR/3M-february-2018.txt ', 'soloCBlen 16 ', 'soloUMIstart 17 ', 'soloUMIlen 12 ', 'soloBarcodeReadLength 1 ', 'soloMultiMappers EM ', 'soloFeatures Gene Velocyto ', 'soloUMIfiltering MultiGeneUMI ', 'soloCBmatchWLtype 1MM_multi_pseudocounts ', 'outSAMtype BAM SortedByCoordinate ', 'outSAMattributes NH HI CR UR CB UB GX GN ', 'outSAMmultNmax 1 ', 'runThreadN 32 ', 'genomeDir /mnt/Sabrent2TBRefsCR/WorkingMouseRefKBtools01012020/WPRE_V5Gestalt_static_DoxMinTm_0621 ', 'sjdbGTFfile /mnt/Sabrent2TBRefsCR/WorkingMouseRefKBtools01012020/WPRE_V5Gestalt_static_DoxMinTm_0621/tmp.gtf ', 'readFilesCommand zcat ', 'readFilesPrefix /mnt/12TBNew0821/AnatDecabitine/SS-15340', '01', '14', '2022/FASTQ/ ', 'readFilesIn PBS-CTRL-GEX_S2_L001_R2_001.fastq.gz,PBS-CTRL-GEX_S2_L002_R2_001.fastq.gz PBS-CTRL-GEX_S2_L001_R1_001.fastq.gz,PBS-CTRL-GEX_S2_L002_R1_001.fastq.gz ', 'outTmpDir=/mnt/Sabrent4TBMsTum/AY_7319_Dox_Suc_Etv5_07_19/STARtmp ', 'outFileNamePrefix star_out/PBS2/'] outSAMattributes NH HI CR UR CB UB GX GN CB and UB tags present in BAM file samtools view --threads 48 -d GN -U Tester_nogenes.bam -O BAM -o Tester_genes.bam /mnt/12TBNew0821/AnatDecabitine/SS-15340--01--14--2022/FASTQ/star_out/PBS2/Aligned.sortedByCoord.out.bam samtools view --threads 48 -O BAM -o Tester_nogenes_overlappingtes.bam -L /mnt/Sabrent2TBRefsCR/SoloTE/Mm10TEannotation.bed Tester_nogenes.bam samtools index Tester_nogenes_overlappingtes.bam bedtools bamtobed -i Tester_nogenes_overlappingtes.bam -split > Tester_nogenes_overlappingtes.bed bedtools intersect -a /mnt/Sabrent2TBRefsCR/SoloTE/Mm10TEannotation.bed -b Tester_nogenes_overlappingtes.bed -u > Tester_selectedtes.bed python /mnt/Sabrent2TBRefsCR/SoloTE/SoloTE_v1/annotateBAM.py Tester_nogenes_overlappingtes.bam Tester_selectedtes.bed Tester_teannotated.bam samtools cat --threads 48 -o Tester_full.bam Tester_genes.bam Tester_teannotated.bam samtools sort --threads 48 -O BAM -o Tester_full_sorted.bam Tester_full.bam [bam_sort_core] merging from 144 files and 48 in-memory blocks... samtools view -U Tester_final.bam --tag CB:- -@ 8 Tester_full_sorted.bam > /dev/null samtools index Tester_final.bam [] grep: Tester_allcounts.txt: No such file or directory grep: Tester_allcounts.txt: No such file or directory grep: Tester_allcounts.txt: No such file or directory


***** ERROR: Requested column 3, but database file - only has fields 1 - 0.


***** ERROR: Requested column 3, but database file - only has fields 1 - 0.


** ERROR: Requested column 3, but database file - only has fields 1 - 0. Error in read.table(file = file, header = header, sep = sep, quote = quote, : no lines available in input Calls: read.delim -> read.table Execution halted Creating final results directory /mnt/12TBNew0821/AnatDecabitine/SS-15340--01--14--2022/FASTQ/star_out/PBS2/Tester_SoloTE_output was created Traceback (most recent call last): File "/mnt/Sabrent2TBRefsCR/SoloTE/SoloTE_v1/SoloTE_pipeline.py", line 238, in file_table = pd.read_table(input_te_file,header=None,sep="\t") File "/home/jjbtr32643970xadmin/anaconda3/lib/python3.9/site-packages/pandas/util/_decorators.py", line 311, in wrapper return func(args, kwargs) File "/home/jjbtr32643970xadmin/anaconda3/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 779, in read_table return _read(filepath_or_buffer, kwds) File "/home/jjbtr32643970xadmin/anaconda3/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 575, in _read parser = TextFileReader(filepath_or_buffer, kwds) File "/home/jjbtr32643970xadmin/anaconda3/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 934, in init self._engine = self._make_engine(f, self.engine) File "/home/jjbtr32643970xadmin/anaconda3/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1236, in _make_engine return mapping[engine](f, self.options) File "/home/jjbtr32643970xadmin/anaconda3/lib/python3.9/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 75, in init self._reader = parsers.TextReader(src, **kwds) File "pandas/_libs/parsers.pyx", line 551, in pandas._libs.parsers.TextReader.cinit pandas.errors.EmptyDataError: No columns to parse from file

bvaldebenitom commented 1 year ago

Hi @JBreunig,

can you share the command used, the first lines of your TE BED file (using head) and the output of ls -Rlht on the results folder?

JBreunig commented 1 year ago

Here you go...thanks again:

$$head Mm10TEannotation.bed chr1 3000001 3002128 chr1|3000001|3002128|L1_Mus3:L1:LINE|- 12955 - chr1 3003153 3003994 chr1|3003153|3003994|L1Md_F:L1:LINE|- 1216 - chr1 3003994 3004054 chr1|3003994|3004054|L1_Mus3:L1:LINE|- 234 - chr1 3004041 3004206 chr1|3004041|3004206|L1_Rod:L1:LINE|+ 3685 + chr1 3004271 3005001 chr1|3004271|3005001|L1_Rod:L1:LINE|+ 3685 + chr1 3005002 3005439 chr1|3005002|3005439|L1_Rod:L1:LINE|+ 1280 + chr1 3005461 3005548 chr1|3005461|3005548|Lx9:L1:LINE|+ 4853 + chr1 3005571 3006764 chr1|3005571|3006764|Lx9:L1:LINE|+ 4853 + chr1 3007015 3007268 chr1|3007015|3007268|L1M4:L1:LINE|- 438 - chr1 3008117 3008483 chr1|3008117|3008483|L1_Mur2:L1:LINE|- 1590 -

$ls -Rlht .: total 61G -rw-rw-r-- 1 jjbtr32643970xadmin jjbtr32643970xadmin 0 Oct 11 20:45 Tester_allcounts_final.txt -rw-rw-r-- 1 jjbtr32643970xadmin jjbtr32643970xadmin 0 Oct 11 20:45 Tester_subftes_2.txt -rw-rw-r-- 1 jjbtr32643970xadmin jjbtr32643970xadmin 0 Oct 11 20:45 Tester_genes_2.txt -rw-rw-r-- 1 jjbtr32643970xadmin jjbtr32643970xadmin 0 Oct 11 20:45 Tester_locustes_2.txt -rw-rw-r-- 1 jjbtr32643970xadmin jjbtr32643970xadmin 0 Oct 11 20:45 Tester_genes.txt -rw-rw-r-- 1 jjbtr32643970xadmin jjbtr32643970xadmin 0 Oct 11 20:45 Tester_locustes.txt -rw-rw-r-- 1 jjbtr32643970xadmin jjbtr32643970xadmin 0 Oct 11 20:45 Tester_subftes.txt -rw-rw-r-- 1 jjbtr32643970xadmin jjbtr32643970xadmin 5.7M Oct 11 20:45 Tester_final.bam.bai -rw-rw-r-- 1 jjbtr32643970xadmin jjbtr32643970xadmin 16G Oct 11 20:41 Tester_final.bam -rw-rw-r-- 1 jjbtr32643970xadmin jjbtr32643970xadmin 16G Oct 11 19:59 Tester_full_sorted.bam -rw-rw-r-- 1 jjbtr32643970xadmin jjbtr32643970xadmin 16G Oct 11 19:54 Tester_full.bam -rw-rw-r-- 1 jjbtr32643970xadmin jjbtr32643970xadmin 2.1K Oct 11 19:53 Tester_teannotated.bam -rw-rw-r-- 1 jjbtr32643970xadmin jjbtr32643970xadmin 0 Oct 11 19:53 Tester_selectedtes.bed -rw-rw-r-- 1 jjbtr32643970xadmin jjbtr32643970xadmin 592 Oct 11 19:53 Tester_nogenes_overlappingtes.bam.bai -rw-rw-r-- 1 jjbtr32643970xadmin jjbtr32643970xadmin 0 Oct 11 19:53 Tester_nogenes_overlappingtes.bed -rw-rw-r-- 1 jjbtr32643970xadmin jjbtr32643970xadmin 2.1K Oct 11 19:53 Tester_nogenes_overlappingtes.bam

bvaldebenitom commented 1 year ago

Can you share the first lines of the output of samtools view on your BAM file?

JBreunig commented 1 year ago

Here you go:

samtools view Aligned.sortedByCoord.out.bam | head A00319:434:H27L5DRX2:1:2205:32190:11898 16 1 3001665 255 90M 0 0 ATATTGTGTGAATTTTGTTTGGTCGTGGAATACTTTGGTTTCTCCATCTATGGTAATTGAGAGTTTGGCCGGGTATAGTAGCCTGGGCTG F,,,,F,F,F:,F,,,F::,,F,FF,FF::,FF:,,FF,:FFFFFF:F,:,FF:F,,FFFF:F:,,FFFFFFF,F,FF::FFF,FFFF:F NH:i:1 HI:i:1 CR:Z:TGGTGATTCTTGAGCA UR:Z:TTTACATTTCCG GX:Z:- GN:Z:- CB:Z:TGGTGATTCTTGAGCA UB:Z:TTTACATTTCCG A00319:434:H27L5DRX2:2:2236:3215:27978 16 1 3015473 255 90M 0 0 GTTACTTCACTCAGGATGATACCCTCCAGGTCCATCCATTTGCCTAGGAATCTCATAAATTCATTTTTTAATAGCTGAGTAGTATTCCAT FFFFFFFF:FFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:1 HI:i:1 CR:Z:TCTAACTTCTGTCGCT UR:Z:ATATAAGAATCC GX:Z:- GN:Z:- CB:Z:TCTAACTTCTGTCGCT UB:Z:ATATAAGAATCC A00319:434:H27L5DRX2:1:2258:30572:15452 16 1 3016298 0 90M 0 0 ACTTTCTCCTCTGTAAGTTTCAGTGTCTCTGGTTTTATGTGGAGTTCCTTAATCCACTTAGATTTGACCTTAGTACAAGGAGATAGGAAT F::FFFFFFFFFFFF:FFF:FFFFFFFFF:FFFFF:,FFFFFFF,FFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:9 HI:i:1 CR:Z:TGGGCTGCAGCTACAT UR:Z:CCTCTGAATCCT GX:Z:- GN:Z:- CB:Z:TGGGCTGCAGCTACAT UB:Z:CCTCTGAATCCT A00319:434:H27L5DRX2:2:2144:5665:32565 16 1 3018672 1 90M 0 0 TTTTGTTTTAGGATAAAATGTTCTGTAGATATCTGTCAAGTCCATTTGTTTCATCACTTCTGTTAGTTTCACTGTGTCCCTGTTTAGTTT ::FFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:3 HI:i:1 CR:Z:ACTTTGTAGAAGGTAG UR:Z:GAGCATGGCTAT GX:Z:- GN:Z:- CB:Z:ACTTTGTAGAAGGTAG UB:Z:GAGCATGGCTAT A00319:434:H27L5DRX2:2:2112:6506:30138 16 1 3018677 1 90M 0 0 TTTTAGGATAAAATGTTCTGTAGATATCTGTCAAGTCCATTTGTTTCATCACTTCTGTTAGTTTCACTGTGTCCCTGTTTAGTTTCTGTT FFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFF,F:FFFF:FFFFFFFFFFFFFFFF:FF:FFFFFFFFFFFF:FFFFFFFFFFFFFFFFF NH:i:3 HI:i:1 CR:Z:CATACTTCAACCGATT UR:Z:CTTAAGTTTCTA GX:Z:- GN:Z:- CB:Z:CATACTTCAACCGATT UB:Z:CTTAAGTTTCTA A00319:434:H27L5DRX2:2:2102:32380:16297 16 1 3019419 0 90M 0 0 TACTTTGGTTTCTCCATCTATGGTTATTGAGAGGTTGGCTGGGTATAGTAGCCTGGGCTGGAATTTGTGTTCTCTTAGTGTCTGTATAAC ,FFF:,:::,:F,FFF:F:F,:FF,FFFFFFFF,F:::FF:F::F,:F,,FF:::::F:F:FFF,F:FFFFFFFF,:F,F:F,:F:FF,F NH:i:8 HI:i:1 CR:Z:CTCTGGTCATTAAGCC UR:Z:GAATCATTGGCA GX:Z:- GN:Z:- CB:Z:CTCTGGTCATTAAGCC UB:Z:GAATCATTGGCA A00319:434:H27L5DRX2:2:2153:18530:36401 16 1 3020095 3 90M 0 0 TTTGTTCATTTCCATCACCTGTTTGGATGTGTTTTCCTGTTTTTCTATACGGACTTCTACCTGTTTGGTTGTGTTTTCCTGTTTTTCTTT FFFFFFFFFFFF::F:FFFFF,F,,FFFFFFFF,FFFFFFFFFFF,FFF,FFFFFFF:FFF:F:F,FFFFFFF:FFFFFFFFF:FFFFFF NH:i:2 HI:i:1 CR:Z:GCAGCCACACTACACA UR:Z:CGTTGGGATCAT GX:Z:- GN:Z:- CB:Z:GCAGCCACACTACACA UB:Z:CGTTGGGATCAT A00319:434:H27L5DRX2:2:2102:32081:3129 0 1 3025098 255 88M2S 0 0 ACCCTCCAGTGGAAAAAAGACAGCATTGTCAACAAAGGGTGTGGGCACAACTGGTGGTTATCATCATGAAGAATGCAAATTGATCCATTC :,F,,FF,F,,:FF,:F:FFF::::::,FF,FF,FF,F,,F:,::F,F,FF,:F,F,,,,,F,F,,,FFFF,::,F:,F,F,FFFF,,FF NH:i:1 HI:i:1 CR:Z:TGACCACAGGCATGGT UR:Z:CCGATTTAATGG GX:Z:- GN:Z:- CB:Z:TGTCCACAGGCATGGT UB:Z:CCGATTTAATGG A00319:434:H27L5DRX2:2:2119:26133:23171 0 1 3038147 255 90M 0 0 AGATAACTGTGCACCTCCCTGAAAGAGGAGAGCTTGCCTGCAGAGACTGCTCTGACCCCTGAAACTCAGGGAAGAGAGCTAGTCTCCCTG FF:FFFFFFFFFFFFFFFFFFFFFFFFFFF:FFF:FFF,FFF:FFFFFFFFFFFFFFFF:FFFFFF:FFFFF,:,FF:F,FF:FF:,F:F NH:i:1 HI:i:1 CR:Z:TGCTGAACACCAGCGT UR:Z:GCATTGTTCTCC GX:Z:- GN:Z:- CB:Z:TGCTGAACACCAGCGT UB:Z:GCATTGTTCTCC A00319:434:H27L5DRX2:2:2236:1488:5791 0 1 3038346 255 24S66M 0 0 GTTCATTTCAGCTTTTCACACCTCTGTACTAACAGGAACCAAGACCACTCACCATCACCAGAACCCAGCACACCCACTTCGCCCAGTCCA FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:1 HI:i:1 CR:Z:CTTACCGAGAGCAACC UR:Z:AAATTTTCTATA GX:Z:- GN:Z:- CB:Z:CTTACCGAGAGCAACC UB:Z:AAATTTTCTATA

bvaldebenitom commented 1 year ago

Ok, I see the issue: your BAM file doesn't match the names of the BED file. For example, BED file has "chr1" whereas the BAM has "1".

First, delete all files from the SoloTE output directory you were using, and then use sed 's/^chr//' BEDfile > NEWBEDfile to create a BED file that matches the chromosome annotation of the alignment. Afterwards, run SoloTE with this new BED file.

JBreunig commented 1 year ago

That makes sense but the same error happened despite deleting directories and referring to new BED. Could it be the second 'chr' in the bed file? 1 3000001 3002128 chr1|3000001|3002128|L1_Mus3:L1:LINE|- 12955 -

Shall I try: sed 's/^chr//; s/chr//' Mm10TEannotation.bed > Mm10TEannotationV3.bed

edit: that didn't fix it either.

bvaldebenitom commented 1 year ago

Looking at the BAM lines you shared, an additional issue is the reads not associated with genes have the GN:Z:- tag, instead of not having that tag (as what we used during the development of SoloTE).

This is similar to issue #3, and we are working on a fix. Can you share your alignment protocol? This would help us better expand the usability and compatibility of our tool. In #3, there was a mention of using the Cumulus's STAR solo pipeline.

.

JBreunig commented 1 year ago

Yes, I'm also using STARsolo (but not Cumulus). I use a custom reference based on mm10 but which includes a handful transgenes that we add. Happy to share the BAM or other items if it helps.

bvaldebenitom commented 1 year ago

Thanks for the quick reply.

And yes, I would appreciate it a lot If you could share the BAM file of chromosome 1 only, which hopefully should be enough for validation (we are now working on the fix to this issue).

JBreunig commented 1 year ago

Just tried to run and got a new error:

python /mnt/Sabrent2TBRefsCR/SoloTE/SoloTE_v1/SoloTE_pipeline.py /mnt/12TBNew0821/AnatDecabitine/star_out/PBS2/Aligned.sortedByCoord.out.bam 48 Tester /mnt/Sabrent2TBRefsCR/SoloTE/Mm10TEannotationV2.bed SoloTE started at 10:15:11 samtools found! bedtools found! ['1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '2', '3', '4', '5', '6', '7', '8', '9', 'MT', 'X', 'Y', 'JH584299.1', 'GL456233.1', 'JH584301.1', 'GL456211.1', 'GL456350.1', 'JH584293.1', 'GL456221.1', 'JH584297.1', 'JH584296.1', 'GL456354.1', 'JH584294.1', 'JH584298.1', 'JH584300.1', 'GL456219.1', 'GL456210.1', 'JH584303.1', 'JH584302.1', 'GL456212.1', 'JH584304.1', 'GL456379.1', 'GL456216.1', 'GL456393.1', 'GL456366.1', 'GL456367.1', 'GL456239.1', 'GL456213.1', 'GL456383.1', 'GL456385.1', 'GL456360.1', 'GL456378.1', 'GL456389.1', 'GL456372.1', 'GL456370.1', 'GL456381.1', 'GL456387.1', 'GL456390.1', 'GL456394.1', 'GL456392.1', 'GL456382.1', 'GL456359.1', 'GL456396.1', 'GL456368.1', 'JH584292.1', 'JH584295.1', 'postWPREV5gestalt', 'BFP2', 'mTom', 'mGFP', 'YAPRELApA', 'DNETV5Celltag'] ['@CO\tuser command line: /mnt/Sabrent2TBRefsCR/STARlatest0822/STAR ', 'quantMode GeneCounts ', 'soloType CB_UMI_Simple ', 'soloCBwhitelist /mnt/Sabrent2TBRefsCR/STAR/3M-february-2018.txt ', 'soloCBlen 16 ', 'soloUMIstart 17 ', 'soloUMIlen 12 ', 'soloBarcodeReadLength 1 ', 'soloMultiMappers EM ', 'soloFeatures Gene Velocyto ', 'soloUMIfiltering MultiGeneUMI ', 'soloCBmatchWLtype 1MM_multi_pseudocounts ', 'outSAMtype BAM SortedByCoordinate ', 'outSAMattributes NH HI CR UR CB UB GX GN ', 'outSAMmultNmax 1 ', 'runThreadN 32 ', 'genomeDir /mnt/Sabrent2TBRefsCR/WorkingMouseRefKBtools01012020/WPRE_V5Gestalt_static_DoxMinTm_0621 ', 'sjdbGTFfile /mnt/Sabrent2TBRefsCR/WorkingMouseRefKBtools01012020/WPRE_V5Gestalt_static_DoxMinTm_0621/tmp.gtf ', 'readFilesCommand zcat ', 'readFilesPrefix /mnt/12TBNew0821/AnatDecabitine/SS-15340', '01', '14', '2022/FASTQ/ ', 'readFilesIn PBS-CTRL-GEX_S2_L001_R2_001.fastq.gz,PBS-CTRL-GEX_S2_L002_R2_001.fastq.gz PBS-CTRL-GEX_S2_L001_R1_001.fastq.gz,PBS-CTRL-GEX_S2_L002_R1_001.fastq.gz ', 'outTmpDir=/mnt/Sabrent4TBMsTum/AY_7319_Dox_Suc_Etv5_07_19/STARtmp ', 'outFileNamePrefix star_out/PBS2/'] 1 outSAMattributes NH HI CR UR CB UB GX GN CB and UB tags present in BAM file Traceback (most recent call last): File "/mnt/Sabrent2TBRefsCR/SoloTE/SoloTE_v1/SoloTE_pipeline.py", line 81, in outprefix = sys.argv[5] IndexError: list index out of range

Edit, looking at the code, it sems like there is a new command line argument for outprefix so I added one and it is running now.

bvaldebenitom commented 1 year ago

@JBreunig you are correct. We detected a bug when running it, so we modified the command line arguments.

Please let us know if you are able to successfully run the pipeline, in order to make an official release of this updated version.

JBreunig commented 1 year ago

Everything looks good through to processing in Seurat...thanks!

bvaldebenitom commented 1 year ago

You're more than welcome!