bvaldebenitom / SoloTE

GNU General Public License v3.0
23 stars 6 forks source link

Error running soloTE : "ERROR: Requested column 3, but database file - only has fields 1 - 0" #14

Closed yeroslaviz closed 1 year ago

yeroslaviz commented 1 year ago

I am having a similar problem as a few other users hee, but can't solve it.

I'm working with the following tools within a conda environment

I'm getting a similar error:

$ python SoloTE/SoloTE_1.08/SoloTE_pipeline.py -t 20 -d SoloTE/output/ -o cellranger -a SoloTE/dm6.soloTE_ChromNames_corrected.bed  -b Ctrl_FC/outs/possorted_genome_bam.bam
SoloTE started at 14:45:05
samtools found!
bedtools found!
BAM file generated using CellRanger
SoloTE v1.08 started!
SoloTE Home directory /fs/gpfs41/lv02/fileset01/pool/pool-cox-projects-bioinformatics/AG_Plitzko/SvenK_Retrotransposons/SoloTE/SoloTE_1.08
SoloTE executed from /fs/gpfs41/lv02/fileset01/pool/pool-cox-projects-bioinformatics/AG_Plitzko/SvenK_Retrotransposons
Results will be stored in /fs/gpfs41/lv02/fileset01/pool/pool-cox-projects-bioinformatics/AG_Plitzko/SvenK_Retrotransposons/SoloTE/output
Input BAM file: /fs/gpfs41/lv02/fileset01/pool/pool-cox-projects-bioinformatics/AG_Plitzko/SvenK_Retrotransposons/Ctrl_FC/outs/possorted_genome_bam.bam
Input TE BED file: /fs/gpfs41/lv02/fileset01/pool/pool-cox-projects-bioinformatics/AG_Plitzko/SvenK_Retrotransposons/SoloTE/dm6.soloTE_ChromNames_corrected.bed
Currently working in temporary directory: /fs/gpfs41/lv02/fileset01/pool/pool-cox-projects-bioinformatics/AG_Plitzko/SvenK_Retrotransposons/SoloTE/output/cellranger_SoloTE_temp
samtools view --threads 20 -d GN -U cellranger_nogenes_oldtag.bam -O BAM -o temp.bam /fs/gpfs41/lv02/fileset01/pool/pool-cox-projects-bioinformatics/AG_Plitzko/SvenK_Retrotransposons/Ctrl_FC/outs/possorted_genome_bam.bam
samtools view -d GN:- -o cellranger_nogenes_newtag.bam -U cellranger_genes.bam -O BAM temp.bam
samtools cat --threads 20 -o cellranger_nogenes.bam cellranger_nogenes_oldtag.bam cellranger_nogenes_newtag.bam
samtools view --threads 20 -O BAM -o cellranger_nogenes_overlappingtes.bam -L /fs/gpfs41/lv02/fileset01/pool/pool-cox-projects-bioinformatics/AG_Plitzko/SvenK_Retrotransposons/SoloTE/dm6.soloTE_ChromNames_corrected.bed cellranger_nogenes.bam
samtools index cellranger_nogenes_overlappingtes.bam
bedtools bamtobed -i cellranger_nogenes_overlappingtes.bam -split > cellranger_nogenes_overlappingtes.bed
bedtools intersect -a /fs/gpfs41/lv02/fileset01/pool/pool-cox-projects-bioinformatics/AG_Plitzko/SvenK_Retrotransposons/SoloTE/dm6.soloTE_ChromNames_corrected.bed -b cellranger_nogenes_overlappingtes.bed -u > cellranger_selectedtes.bed
python /fs/gpfs41/lv02/fileset01/pool/pool-cox-projects-bioinformatics/AG_Plitzko/SvenK_Retrotransposons/SoloTE/SoloTE_1.08/annotateBAM.py cellranger_nogenes_overlappingtes.bam cellranger_selectedtes.bed cellranger_teannotated.bam
samtools cat --threads 20 -o cellranger_full.bam cellranger_genes.bam cellranger_teannotated.bam
samtools sort --threads 20 -O BAM -o cellranger_full_sorted.bam cellranger_full.bam
[bam_sort_core] merging from 12 files and 20 in-memory blocks...
samtools view -U cellranger_final.bam --tag CB:- -@ 8 cellranger_full_sorted.bam > /dev/null
samtools index cellranger_final.bam
Process: 761619 2L
Process: 761620 2R
Process: 761621 3L
Process: 761622 3R
Process: 761623 4
Process: 761624 mitochondrion_genome
Process: 761625 X
Process: 761626 Y
samtools view --keep-tag GN,CB,UB cellranger_final.bam 2L -e 'exists([CB]) && exists([UB]) && exists([GN]) && [CB]!="-" && [UB]!="-"'|awk 'BEGIN{FS=OFS="\t"}{for (tag_index=NF-2;tag_index<=NF;tag_index++){ tag_index_s=gensub(/([A-Z]{2}):Z:(.*)/,"\\1\t\\2","g",$tag_index); split(tag_index_s,t,"\t");tag_dict[t[1]]=t[2]}; print tag_dict["GN"],tag_dict["CB"],tag_dict["UB"]}' > cellranger_countpercell_2L.counts.tmp
samtools view --keep-tag GN,CB,UB cellranger_final.bam 3R -e 'exists([CB]) && exists([UB]) && exists([GN]) && [CB]!="-" && [UB]!="-"'|awk 'BEGIN{FS=OFS="\t"}{for (tag_index=NF-2;tag_index<=NF;tag_index++){ tag_index_s=gensub(/([A-Z]{2}):Z:(.*)/,"\\1\t\\2","g",$tag_index); split(tag_index_s,t,"\t");tag_dict[t[1]]=t[2]}; print tag_dict["GN"],tag_dict["CB"],tag_dict["UB"]}' > cellranger_countpercell_3R.counts.tmp
samtools view --keep-tag GN,CB,UB cellranger_final.bam 2R -e 'exists([CB]) && exists([UB]) && exists([GN]) && [CB]!="-" && [UB]!="-"'|awk 'BEGIN{FS=OFS="\t"}{for (tag_index=NF-2;tag_index<=NF;tag_index++){ tag_index_s=gensub(/([A-Z]{2}):Z:(.*)/,"\\1\t\\2","g",$tag_index); split(tag_index_s,t,"\t");tag_dict[t[1]]=t[2]}; print tag_dict["GN"],tag_dict["CB"],tag_dict["UB"]}' > cellranger_countpercell_2R.counts.tmp
samtools view --keep-tag GN,CB,UB cellranger_final.bam 3L -e 'exists([CB]) && exists([UB]) && exists([GN]) && [CB]!="-" && [UB]!="-"'|awk 'BEGIN{FS=OFS="\t"}{for (tag_index=NF-2;tag_index<=NF;tag_index++){ tag_index_s=gensub(/([A-Z]{2}):Z:(.*)/,"\\1\t\\2","g",$tag_index); split(tag_index_s,t,"\t");tag_dict[t[1]]=t[2]}; print tag_dict["GN"],tag_dict["CB"],tag_dict["UB"]}' > cellranger_countpercell_3L.counts.tmp
samtools view --keep-tag GN,CB,UB cellranger_final.bam 4 -e 'exists([CB]) && exists([UB]) && exists([GN]) && [CB]!="-" && [UB]!="-"'|awk 'BEGIN{FS=OFS="\t"}{for (tag_index=NF-2;tag_index<=NF;tag_index++){ tag_index_s=gensub(/([A-Z]{2}):Z:(.*)/,"\\1\t\\2","g",$tag_index); split(tag_index_s,t,"\t");tag_dict[t[1]]=t[2]}; print tag_dict["GN"],tag_dict["CB"],tag_dict["UB"]}' > cellranger_countpercell_4.counts.tmp
samtools view --keep-tag GN,CB,UB cellranger_final.bam X -e 'exists([CB]) && exists([UB]) && exists([GN]) && [CB]!="-" && [UB]!="-"'|awk 'BEGIN{FS=OFS="\t"}{for (tag_index=NF-2;tag_index<=NF;tag_index++){ tag_index_s=gensub(/([A-Z]{2}):Z:(.*)/,"\\1\t\\2","g",$tag_index); split(tag_index_s,t,"\t");tag_dict[t[1]]=t[2]}; print tag_dict["GN"],tag_dict["CB"],tag_dict["UB"]}' > cellranger_countpercell_X.counts.tmp
samtools view --keep-tag GN,CB,UB cellranger_final.bam mitochondrion_genome -e 'exists([CB]) && exists([UB]) && exists([GN]) && [CB]!="-" && [UB]!="-"'|awk 'BEGIN{FS=OFS="\t"}{for (tag_index=NF-2;tag_index<=NF;tag_index++){ tag_index_s=gensub(/([A-Z]{2}):Z:(.*)/,"\\1\t\\2","g",$tag_index); split(tag_index_s,t,"\t");tag_dict[t[1]]=t[2]}; print tag_dict["GN"],tag_dict["CB"],tag_dict["UB"]}' > cellranger_countpercell_mitochondrion_genome.counts.tmp
samtools view --keep-tag GN,CB,UB cellranger_final.bam Y -e 'exists([CB]) && exists([UB]) && exists([GN]) && [CB]!="-" && [UB]!="-"'|awk 'BEGIN{FS=OFS="\t"}{for (tag_index=NF-2;tag_index<=NF;tag_index++){ tag_index_s=gensub(/([A-Z]{2}):Z:(.*)/,"\\1\t\\2","g",$tag_index); split(tag_index_s,t,"\t");tag_dict[t[1]]=t[2]}; print tag_dict["GN"],tag_dict["CB"],tag_dict["UB"]}' > cellranger_countpercell_Y.counts.tmp
CompletedProcess(args=['wc', '-l', 'cellranger_countpercell_Y.counts'], returncode=0, stdout='46126 cellranger_countpercell_Y.counts\n')
CompletedProcess(args=['wc', '-l', 'cellranger_countpercell_4.counts'], returncode=0, stdout='654118 cellranger_countpercell_4.counts\n')
CompletedProcess(args=['wc', '-l', 'cellranger_countpercell_mitochondrion_genome.counts'], returncode=0, stdout='2290533 cellranger_countpercell_mitochondrion_genome.counts\n')
CompletedProcess(args=['wc', '-l', 'cellranger_countpercell_3L.counts'], returncode=0, stdout='11443591 cellranger_countpercell_3L.counts\n')
CompletedProcess(args=['wc', '-l', 'cellranger_countpercell_X.counts'], returncode=0, stdout='11696933 cellranger_countpercell_X.counts\n')
CompletedProcess(args=['wc', '-l', 'cellranger_countpercell_3R.counts'], returncode=0, stdout='13650943 cellranger_countpercell_3R.counts\n')
CompletedProcess(args=['wc', '-l', 'cellranger_countpercell_2R.counts'], returncode=0, stdout='12907214 cellranger_countpercell_2R.counts\n')
CompletedProcess(args=['wc', '-l', 'cellranger_countpercell_2L.counts'], returncode=0, stdout='11521534 cellranger_countpercell_2L.counts\n')
['cellranger_countpercell_Y.counts', 'cellranger_countpercell_4.counts', 'cellranger_countpercell_mitochondrion_genome.counts', 'cellranger_countpercell_3L.counts', 'cellranger_countpercell_X.counts', 'cellranger_countpercell_3R.counts', 'cellranger_countpercell_2R.counts', 'cellranger_countpercell_2L.counts']

*****
***** ERROR: Requested column 3, but database file - only has fields 1 - 0.
Traceback (most recent call last):
  File "/fs/gpfs41/lv02/fileset01/pool/pool-cox-projects-bioinformatics/AG_Plitzko/SvenK_Retrotransposons/SoloTE/SoloTE_1.08/SoloTE_pipeline.py", line 321, in <module>
    marketmatrix_line3 = genenumber.stdout.split(" ")[0]+" "+barcodenumber.stdout.split(" ")[0]+" "+allcounts_number.stdout.split(" ")[0]
AttributeError: 'str' object has no attribute 'stdout'

I have made sure, that chromosome names in both files are identical. For the command and input files you can have a look at the attached Inputfiles.txt file

any ideas, How I can solve the problem.

thanks

Assa

P.S.

Not sure, if this is relevant. The bed file I have created after downloading the file from UCSC. It also contains all the "Un_" and the "_random" entries, which I don't have in my bam file. The bam files were created using either cellranger or STARsolo with a genome downloaded from Ensembl with only the main chromosomes

$ grep ">" reference/dm.BDGP6.32.fa
>2L dna:primary_assembly primary_assembly:BDGP6.32:2L:1:23513712:1 REF
>2R dna:primary_assembly primary_assembly:BDGP6.32:2R:1:25286936:1 REF
>3L dna:primary_assembly primary_assembly:BDGP6.32:3L:1:28110227:1 REF
>3R dna:primary_assembly primary_assembly:BDGP6.32:3R:1:32079331:1 REF
>4 dna:primary_assembly primary_assembly:BDGP6.32:4:1:1348131:1 REF
>mitochondrion_genome dna:primary_assembly primary_assembly:BDGP6.32:mitochondrion_genome:1:19524:1 REF
>X dna:primary_assembly primary_assembly:BDGP6.32:X:1:23542271:1 REF
>Y dna:primary_assembly primary_assembly:BDGP6.32:Y:1:3667352:1 REF

Do I somehow need to filter my bed file to contain only the entries with these chromosomes?

Inputfiles.txt

bvaldebenitom commented 1 year ago

Hi @yeroslaviz!

Can you share the output of uname -a and the results of the directory created by SoloTE using ls -Rlht SoloTE/output/?

The input BED and the input BAM files seem to be ok based on the lines you sent. Depending on the files that have generated in the temporary directory, we can further assess what's going on!

Best, Braulio.

yeroslaviz commented 1 year ago

Hi, and yhanks for the fats response.

here are the details:

$ uname -a
Linux supercube016 5.15.0-69-generic #76-Ubuntu SMP Fri Mar 17 17:19:29 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

and the second one (I'm running soloTE against two different bam files, but putting the output of only one here, as the same error occurs in both)

$ ls -Rlht SoloTE/output/
SoloTE/output/:
total 2.0K
drwxr-xr-x 2 yeroslaviz b_cox 4.0K May  2 16:27 starsolo_SoloTE_temp
drwxr-xr-x 2 yeroslaviz b_cox 4.0K May  2 16:21 cellranger_SoloTE_temp

SoloTE/output/cellranger_SoloTE_temp:
total 115G
-rw-r--r-- 1 yeroslaviz b_cox  14M May  2 16:23 cellranger_barcodes.tsv
-rw-r--r-- 1 yeroslaviz b_cox 1.7M May  2 16:21 cellranger_genes.tsv
-rw-r--r-- 1 yeroslaviz b_cox 1.7G May  2 16:21 cellranger_allcounts_final.txt
-rw-r--r-- 1 yeroslaviz b_cox  20M May  2 16:21 cellranger_subftes_2.txt
-rw-r--r-- 1 yeroslaviz b_cox    0 May  2 16:21 cellranger_locustes_2.txt
-rw-r--r-- 1 yeroslaviz b_cox 1.7G May  2 16:21 cellranger_genes_2.txt
-rw-r--r-- 1 yeroslaviz b_cox  21M May  2 16:19 cellranger_subftes.txt
-rw-r--r-- 1 yeroslaviz b_cox    0 May  2 16:19 cellranger_locustes.txt
-rw-r--r-- 1 yeroslaviz b_cox 1.7G May  2 16:19 cellranger_genes.txt
-rw-r--r-- 1 yeroslaviz b_cox 1.7G May  2 16:19 cellranger_allcounts.txt
-rw-r--r-- 1 yeroslaviz b_cox 304M May  2 16:19 cellranger_countpercell_2L.counts
-rw-r--r-- 1 yeroslaviz b_cox 344M May  2 16:18 cellranger_countpercell_2R.counts
-rw-r--r-- 1 yeroslaviz b_cox 2.7G May  2 16:17 cellranger_countpercell_2L.counts.tmp
-rw-r--r-- 1 yeroslaviz b_cox 363M May  2 16:17 cellranger_countpercell_3R.counts
-rw-r--r-- 1 yeroslaviz b_cox 320M May  2 16:17 cellranger_countpercell_X.counts
-rw-r--r-- 1 yeroslaviz b_cox 2.6G May  2 16:16 cellranger_countpercell_2R.counts.tmp
-rw-r--r-- 1 yeroslaviz b_cox 303M May  2 16:16 cellranger_countpercell_3L.counts
-rw-r--r-- 1 yeroslaviz b_cox 2.3G May  2 16:15 cellranger_countpercell_3R.counts.tmp
-rw-r--r-- 1 yeroslaviz b_cox 2.4G May  2 16:15 cellranger_countpercell_X.counts.tmp
-rw-r--r-- 1 yeroslaviz b_cox 2.2G May  2 16:14 cellranger_countpercell_3L.counts.tmp
-rw-r--r-- 1 yeroslaviz b_cox  66M May  2 16:10 cellranger_countpercell_mitochondrion_genome.counts
-rw-r--r-- 1 yeroslaviz b_cox 1.4G May  2 16:09 cellranger_countpercell_mitochondrion_genome.counts.tmp
-rw-r--r-- 1 yeroslaviz b_cox  18M May  2 16:03 cellranger_countpercell_4.counts
-rw-r--r-- 1 yeroslaviz b_cox 128M May  2 16:03 cellranger_countpercell_4.counts.tmp
-rw-r--r-- 1 yeroslaviz b_cox 2.5M May  2 16:02 cellranger_countpercell_Y.counts
-rw-r--r-- 1 yeroslaviz b_cox 4.3M May  2 16:02 cellranger_countpercell_Y.counts.tmp
-rw-r--r-- 1 yeroslaviz b_cox 639K May  2 16:02 cellranger_final.bam.bai
-rw-r--r-- 1 yeroslaviz b_cox  18G May  2 15:59 cellranger_final.bam
-rw-r--r-- 1 yeroslaviz b_cox  18G May  2 15:32 cellranger_full_sorted.bam
-rw-r--r-- 1 yeroslaviz b_cox  18G May  2 15:22 cellranger_full.bam
-rw-r--r-- 1 yeroslaviz b_cox  29M May  2 15:21 cellranger_teannotated.bam
-rw-r--r-- 1 yeroslaviz b_cox 1.2M May  2 15:20 cellranger_selectedtes.bed
-rw-r--r-- 1 yeroslaviz b_cox  40M May  2 15:20 cellranger_nogenes_overlappingtes.bed
-rw-r--r-- 1 yeroslaviz b_cox 106K May  2 15:20 cellranger_nogenes_overlappingtes.bam.bai
-rw-r--r-- 1 yeroslaviz b_cox  37M May  2 15:20 cellranger_nogenes_overlappingtes.bam
-rw-r--r-- 1 yeroslaviz b_cox 2.4G May  2 15:20 cellranger_nogenes.bam
-rw-r--r-- 1 yeroslaviz b_cox  18G May  2 15:20 cellranger_genes.bam
-rw-r--r-- 1 yeroslaviz b_cox  858 May  2 15:20 cellranger_nogenes_newtag.bam
-rw-r--r-- 1 yeroslaviz b_cox 2.4G May  2 14:51 cellranger_nogenes_oldtag.bam
-rw-r--r-- 1 yeroslaviz b_cox  18G May  2 14:51 temp.bam

thanks

bvaldebenitom commented 1 year ago

Hi @yeroslaviz,

It looks like there is a small bug almost at the end of the pipeline, as most intermediate files have been generated without issues.

Can you run the attached script inside the cellranger_SoloTE_temp directory with this command? python generatecounts_debug_20230105.py

After running it, please paste here the output.

generatecounts_debug_20230105.tar.gz

fw262 commented 1 year ago

Hello, I have the same issue and this is the output for me after running python generatecounts_debug_20230105.py 154693 124910 1694304

fw262 commented 1 year ago

I managed to get it to work with SoloTE_1.06/SoloTE_pipeline.py instead of SoloTE_1.08

bvaldebenitom commented 1 year ago

Hi @fw262 , thanks for the feedback!!

Please put the attached developmental version in the same folder as you have the SoloTE_pipeline.py script, and run it with the same command (i.e., python SoloTE_developmental_20230503.py (...)). It should continue from the last steps and not cause the error.

SoloTE_developmental_20230503.tar.gz

bvaldebenitom commented 1 year ago

Glad that it worked @fw262 !

yeroslaviz commented 1 year ago

python generatecounts_debug_20230105.py

I have run the generatecounts_debug_20230105.py script and got the following output:

32936 736828 64204077

But running the developmental script resulted in the same error as before:

*****
***** ERROR: Requested column 3, but database file - only has fields 1 - 0.
Creating final results directory
/fs/gpfs41/lv02/fileset01/pool/pool-cox-projects-bioinformatics/AG_Plitzko/SvenK_Retrotransposons/SoloTE/output/chr2L_SoloTE_output was created
Traceback (most recent call last):
  File "/fs/gpfs41/lv02/fileset01/pool/pool-cox-projects-bioinformatics/AG_Plitzko/SvenK_Retrotransposons/SoloTE/SoloTE_1.08/SoloTE_developmental_20230503.py", line 342, in <module>
    file_table = pd.read_table(input_te_file,header=None,sep="\t")
  File "/fs/home/yeroslaviz/miniconda3/envs/soloTE/lib/python3.10/site-packages/pandas/util/_decorators.py", line 211, in wrapper
    return func(*args, **kwargs)
  File "/fs/home/yeroslaviz/miniconda3/envs/soloTE/lib/python3.10/site-packages/pandas/util/_decorators.py", line 331, in wrapper
    return func(*args, **kwargs)
  File "/fs/home/yeroslaviz/miniconda3/envs/soloTE/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1289, in read_table
    return _read(filepath_or_buffer, kwds)
  File "/fs/home/yeroslaviz/miniconda3/envs/soloTE/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 605, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/fs/home/yeroslaviz/miniconda3/envs/soloTE/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1442, in __init__
    self._engine = self._make_engine(f, self.engine)
  File "/fs/home/yeroslaviz/miniconda3/envs/soloTE/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1753, in _make_engine
    return mapping[engine](f, **self.options)
  File "/fs/home/yeroslaviz/miniconda3/envs/soloTE/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 79, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas/_libs/parsers.pyx", line 554, in pandas._libs.parsers.TextReader.__cinit__
pandas.errors.EmptyDataError: No columns to parse from file

(I have re-run this test on a smaller subset containing only chromosome 2L)

yeroslaviz commented 1 year ago

It seems that running the developmental script does create though an output folder with the three needed files.

I still get the error message. I can't say for sure, but this might has something to do with the fact that the features.tsv file in the output folder contains now only two columns instead of three.

Can this be the problem? I copy-paste the complete output from STDOUT here below, as you can follow it better.

For now I don't know if the results are good enough or if I can use the output files in the downstream analysis. Is it just a formatting problem, or if there are other problems in the analysis.

thanks for the help

Assa

$ python SoloTE/SoloTE_1.08/SoloTE_developmental_20230503.py -t 20 -d SoloTE/output/ -o cellranger -a SoloTE/dm6.soloTE_ChromNames_corrected.bed  -b Ctrl_FC/outs/possorted_genome_bam.bam
SoloTE started at 08:31:19
samtools found!
bedtools found!
BAM file generated using CellRanger
SoloTE v1.08 started!
SoloTE Home directory /fs/gpfs41/lv02/fileset01/pool/pool-cox-projects-bioinformatics/AG_Plitzko/SvenK_Retrotransposons/SoloTE/SoloTE_1.08
SoloTE executed from /fs/gpfs41/lv02/fileset01/pool/pool-cox-projects-bioinformatics/AG_Plitzko/SvenK_Retrotransposons
Results will be stored in /fs/gpfs41/lv02/fileset01/pool/pool-cox-projects-bioinformatics/AG_Plitzko/SvenK_Retrotransposons/SoloTE/output
Input BAM file: /fs/gpfs41/lv02/fileset01/pool/pool-cox-projects-bioinformatics/AG_Plitzko/SvenK_Retrotransposons/Ctrl_FC/outs/possorted_genome_bam.bam
Input TE BED file: /fs/gpfs41/lv02/fileset01/pool/pool-cox-projects-bioinformatics/AG_Plitzko/SvenK_Retrotransposons/SoloTE/dm6.soloTE_ChromNames_corrected.bed
Currently working in temporary directory: /fs/gpfs41/lv02/fileset01/pool/pool-cox-projects-bioinformatics/AG_Plitzko/SvenK_Retrotransposons/SoloTE/output/cellranger_SoloTE_temp
samtools view --threads 20 -d GN -U cellranger_nogenes_oldtag.bam -O BAM -o temp.bam /fs/gpfs41/lv02/fileset01/pool/pool-cox-projects-bioinformatics/AG_Plitzko/SvenK_Retrotransposons/Ctrl_FC/outs/possorted_genome_bam.bam
samtools view -d GN:- -o cellranger_nogenes_newtag.bam -U cellranger_genes.bam -O BAM temp.bam
samtools cat --threads 20 -o cellranger_nogenes.bam cellranger_nogenes_oldtag.bam cellranger_nogenes_newtag.bam
samtools view --threads 20 -O BAM -o cellranger_nogenes_overlappingtes.bam -L /fs/gpfs41/lv02/fileset01/pool/pool-cox-projects-bioinformatics/AG_Plitzko/SvenK_Retrotransposons/SoloTE/dm6.soloTE_ChromNames_corrected.bed cellranger_nogenes.bam
samtools index cellranger_nogenes_overlappingtes.bam
bedtools bamtobed -i cellranger_nogenes_overlappingtes.bam -split > cellranger_nogenes_overlappingtes.bed
bedtools intersect -a /fs/gpfs41/lv02/fileset01/pool/pool-cox-projects-bioinformatics/AG_Plitzko/SvenK_Retrotransposons/SoloTE/dm6.soloTE_ChromNames_corrected.bed -b cellranger_nogenes_overlappingtes.bed -u > cellranger_selectedtes.bed
python /fs/gpfs41/lv02/fileset01/pool/pool-cox-projects-bioinformatics/AG_Plitzko/SvenK_Retrotransposons/SoloTE/SoloTE_1.08/annotateBAM.py cellranger_nogenes_overlappingtes.bam cellranger_selectedtes.bed cellranger_teannotated.bam
samtools cat --threads 20 -o cellranger_full.bam cellranger_genes.bam cellranger_teannotated.bam
samtools sort --threads 20 -O BAM -o cellranger_full_sorted.bam cellranger_full.bam
[bam_sort_core] merging from 12 files and 20 in-memory blocks...
samtools view -U cellranger_final.bam --tag CB:- -@ 8 cellranger_full_sorted.bam > /dev/null
samtools index cellranger_final.bam
Process: 808376 2L
Process: 808377 2R
Process: 808379 3L
Process: 808378 3R
Process: 808380 4
Process: 808381 mitochondrion_genome
Process: 808382 X
Process: 808383 Y
samtools view --keep-tag GN,CB,UB cellranger_final.bam 3R -e 'exists([CB]) && exists([UB]) && exists([GN]) && [CB]!="-" && [UB]!="-"'|awk 'BEGIN{FS=OFS="\t"}{for (tag_index=NF-2;tag_index<=NF;tag_index++){ tag_index_s=gensub(/([A-Z]{2}):Z:(.*)/,"\\1\t\\2","g",$tag_index); split(tag_index_s,t,"\t");tag_dict[t[1]]=t[2]}; print tag_dict["GN"],tag_dict["CB"],tag_dict["UB"]}' > cellranger_countpercell_3R.counts.tmp
samtools view --keep-tag GN,CB,UB cellranger_final.bam 2L -e 'exists([CB]) && exists([UB]) && exists([GN]) && [CB]!="-" && [UB]!="-"'|awk 'BEGIN{FS=OFS="\t"}{for (tag_index=NF-2;tag_index<=NF;tag_index++){ tag_index_s=gensub(/([A-Z]{2}):Z:(.*)/,"\\1\t\\2","g",$tag_index); split(tag_index_s,t,"\t");tag_dict[t[1]]=t[2]}; print tag_dict["GN"],tag_dict["CB"],tag_dict["UB"]}' > cellranger_countpercell_2L.counts.tmp
samtools view --keep-tag GN,CB,UB cellranger_final.bam 2R -e 'exists([CB]) && exists([UB]) && exists([GN]) && [CB]!="-" && [UB]!="-"'|awk 'BEGIN{FS=OFS="\t"}{for (tag_index=NF-2;tag_index<=NF;tag_index++){ tag_index_s=gensub(/([A-Z]{2}):Z:(.*)/,"\\1\t\\2","g",$tag_index); split(tag_index_s,t,"\t");tag_dict[t[1]]=t[2]}; print tag_dict["GN"],tag_dict["CB"],tag_dict["UB"]}' > cellranger_countpercell_2R.counts.tmp
samtools view --keep-tag GN,CB,UB cellranger_final.bam 3L -e 'exists([CB]) && exists([UB]) && exists([GN]) && [CB]!="-" && [UB]!="-"'|awk 'BEGIN{FS=OFS="\t"}{for (tag_index=NF-2;tag_index<=NF;tag_index++){ tag_index_s=gensub(/([A-Z]{2}):Z:(.*)/,"\\1\t\\2","g",$tag_index); split(tag_index_s,t,"\t");tag_dict[t[1]]=t[2]}; print tag_dict["GN"],tag_dict["CB"],tag_dict["UB"]}' > cellranger_countpercell_3L.counts.tmp
samtools view --keep-tag GN,CB,UB cellranger_final.bam 4 -e 'exists([CB]) && exists([UB]) && exists([GN]) && [CB]!="-" && [UB]!="-"'|awk 'BEGIN{FS=OFS="\t"}{for (tag_index=NF-2;tag_index<=NF;tag_index++){ tag_index_s=gensub(/([A-Z]{2}):Z:(.*)/,"\\1\t\\2","g",$tag_index); split(tag_index_s,t,"\t");tag_dict[t[1]]=t[2]}; print tag_dict["GN"],tag_dict["CB"],tag_dict["UB"]}' > cellranger_countpercell_4.counts.tmp
samtools view --keep-tag GN,CB,UB cellranger_final.bam Y -e 'exists([CB]) && exists([UB]) && exists([GN]) && [CB]!="-" && [UB]!="-"'|awk 'BEGIN{FS=OFS="\t"}{for (tag_index=NF-2;tag_index<=NF;tag_index++){ tag_index_s=gensub(/([A-Z]{2}):Z:(.*)/,"\\1\t\\2","g",$tag_index); split(tag_index_s,t,"\t");tag_dict[t[1]]=t[2]}; print tag_dict["GN"],tag_dict["CB"],tag_dict["UB"]}' > cellranger_countpercell_Y.counts.tmp
samtools view --keep-tag GN,CB,UB cellranger_final.bam mitochondrion_genome -e 'exists([CB]) && exists([UB]) && exists([GN]) && [CB]!="-" && [UB]!="-"'|awk 'BEGIN{FS=OFS="\t"}{for (tag_index=NF-2;tag_index<=NF;tag_index++){ tag_index_s=gensub(/([A-Z]{2}):Z:(.*)/,"\\1\t\\2","g",$tag_index); split(tag_index_s,t,"\t");tag_dict[t[1]]=t[2]}; print tag_dict["GN"],tag_dict["CB"],tag_dict["UB"]}' > cellranger_countpercell_mitochondrion_genome.counts.tmp
samtools view --keep-tag GN,CB,UB cellranger_final.bam X -e 'exists([CB]) && exists([UB]) && exists([GN]) && [CB]!="-" && [UB]!="-"'|awk 'BEGIN{FS=OFS="\t"}{for (tag_index=NF-2;tag_index<=NF;tag_index++){ tag_index_s=gensub(/([A-Z]{2}):Z:(.*)/,"\\1\t\\2","g",$tag_index); split(tag_index_s,t,"\t");tag_dict[t[1]]=t[2]}; print tag_dict["GN"],tag_dict["CB"],tag_dict["UB"]}' > cellranger_countpercell_X.counts.tmp
CompletedProcess(args=['wc', '-l', 'cellranger_countpercell_Y.counts'], returncode=0, stdout='46126 cellranger_countpercell_Y.counts\n')
CompletedProcess(args=['wc', '-l', 'cellranger_countpercell_4.counts'], returncode=0, stdout='654118 cellranger_countpercell_4.counts\n')
CompletedProcess(args=['wc', '-l', 'cellranger_countpercell_mitochondrion_genome.counts'], returncode=0, stdout='2290533 cellranger_countpercell_mitochondrion_genome.counts\n')
CompletedProcess(args=['wc', '-l', 'cellranger_countpercell_3L.counts'], returncode=0, stdout='11443591 cellranger_countpercell_3L.counts\n')
CompletedProcess(args=['wc', '-l', 'cellranger_countpercell_X.counts'], returncode=0, stdout='11696933 cellranger_countpercell_X.counts\n')
CompletedProcess(args=['wc', '-l', 'cellranger_countpercell_3R.counts'], returncode=0, stdout='13650943 cellranger_countpercell_3R.counts\n')
CompletedProcess(args=['wc', '-l', 'cellranger_countpercell_2R.counts'], returncode=0, stdout='12907214 cellranger_countpercell_2R.counts\n')
CompletedProcess(args=['wc', '-l', 'cellranger_countpercell_2L.counts'], returncode=0, stdout='11521534 cellranger_countpercell_2L.counts\n')
['cellranger_countpercell_Y.counts', 'cellranger_countpercell_4.counts', 'cellranger_countpercell_mitochondrion_genome.counts', 'cellranger_countpercell_3L.counts', 'cellranger_countpercell_X.counts', 'cellranger_countpercell_3R.counts', 'cellranger_countpercell_2R.counts', 'cellranger_countpercell_2L.counts']

*****
***** ERROR: Requested column 3, but database file - only has fields 1 - 0.
Creating final results directory
/fs/gpfs41/lv02/fileset01/pool/pool-cox-projects-bioinformatics/AG_Plitzko/SvenK_Retrotransposons/SoloTE/output/cellranger_SoloTE_output was created
Traceback (most recent call last):
  File "/fs/gpfs41/lv02/fileset01/pool/pool-cox-projects-bioinformatics/AG_Plitzko/SvenK_Retrotransposons/SoloTE/SoloTE_1.08/SoloTE_developmental_20230503.py", line 342, in <module>
    file_table = pd.read_table(input_te_file,header=None,sep="\t")
  File "/fs/home/yeroslaviz/miniconda3/envs/soloTE/lib/python3.10/site-packages/pandas/util/_decorators.py", line 211, in wrapper
    return func(*args, **kwargs)
  File "/fs/home/yeroslaviz/miniconda3/envs/soloTE/lib/python3.10/site-packages/pandas/util/_decorators.py", line 331, in wrapper
    return func(*args, **kwargs)
  File "/fs/home/yeroslaviz/miniconda3/envs/soloTE/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1289, in read_table
    return _read(filepath_or_buffer, kwds)
  File "/fs/home/yeroslaviz/miniconda3/envs/soloTE/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 605, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/fs/home/yeroslaviz/miniconda3/envs/soloTE/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1442, in __init__
    self._engine = self._make_engine(f, self.engine)
  File "/fs/home/yeroslaviz/miniconda3/envs/soloTE/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1753, in _make_engine
    return mapping[engine](f, **self.options)
  File "/fs/home/yeroslaviz/miniconda3/envs/soloTE/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 79, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas/_libs/parsers.pyx", line 554, in pandas._libs.parsers.TextReader.__cinit__
pandas.errors.EmptyDataError: No columns to parse from file
bvaldebenitom commented 1 year ago

@yeroslaviz can you confirm the number of lines in the following files with wc -l?

cellranger_subftes_2.txt
cellranger_locustes_2.txt
cellranger_genes_2.txt

From the previous answer, it may seem that one of these files are empty.

Additionally, please add here the output of samtools view -c -q 255 cellranger_teannotated.bam

yeroslaviz commented 1 year ago

Hi, yes, you're right the cellranger_locustes_2.txt file is empty.

What should I expect to see in this file?

also the other results are:

$ wc -l SoloTE/output/cellranger_SoloTE_temp/cellranger_subftes_2.txt
348828 SoloTE/output/cellranger_SoloTE_temp/cellranger_subftes_2.txt
$ wc -l SoloTE/output/cellranger_SoloTE_temp/cellranger_locustes_2.txt 
0 SoloTE/output/cellranger_SoloTE_temp/cellranger_locustes_2.txt
$ wc -l SoloTE/output/cellranger_SoloTE_temp/cellranger_genes_2.txt 
63855249 SoloTE/output/cellranger_SoloTE_temp/cellranger_genes_2.txt

$ samtools view -c -q 255 SoloTE/output/cellranger_SoloTE_temp/cellranger_teannotated.bam
268090

Does it has anything to do with the fact that I'm working on Drosophila?

bvaldebenitom commented 1 year ago

Basically, it is indicating that no locus-specific TEs are being found. This is odd, because the reads with MAPQ = 255 should be annotated as locus-specific, and there are 268090 in the TE BAM file. Would it be possible for you to send me the cellranger_teannotated.bam to inspect it?

yeroslaviz commented 1 year ago

Hi, yes of course.

you can find the fiule under this link - https://datashare.biochem.mpg.de/s/a8QtGebXmcBRJ0d

it is approx. 30Mb in size, so not that big.

thanks for the help.

bvaldebenitom commented 1 year ago

Happy to help, and thanks for sharing the BAM file!

There is a bug for locus-specific TEs. The "chr" word is searched when summarizing these TEs. We will work on updating the pipeline so this doesn't cause errors. In the meantime, please try updating your TE BED file using this command: awk 'BEGIN{FS=OFS="\t"}{$4="chr"$4; print $0}' dm6.soloTE_ChromNames_corrected.bed > dm6_SoloTE_ChrIdsFixed.bed

It should result in something like this:

2L  9726    9861    chr2L|9726|9861|IDEFIX_LTR:Gypsy:LTR|+  39  +
2L  9889    9993    chr2L|9889|9993|DNAREP1_DM:Helitron:RC|-    71  -
2L  13255   13311   chr2L|13255|13311|DNAREP1_DM:Helitron:RC|-  16  -
2L  15866   15955   chr2L|15866|15955|DNAREP1_DM:Helitron:RC|-  44  -
2L  24235   24484   chr2L|24235|24484|DNAREP1_DM:Helitron:RC|+  183 +
2L  27528   27835   chr2L|27528|27835|DNAREP1_DM:Helitron:RC|-  70  -
2L  47515   52544   chr2L|47515|52544|LINEJ1_DM:Jockey:LINE|+   6385    +
2L  60251   60631   chr2L|60251|60631|DNAREP1_DM:Helitron:RC|+  295 +
2L  60652   60786   chr2L|60652|60786|DNAREP1_DM:Helitron:RC|+  75  +
2L  64280   64914   chr2L|64280|64914|BS2:Jockey:LINE|+ 679 +

Then, re-run the pipeline in a new directory.

Let me know how this goes!

yeroslaviz commented 1 year ago

Hi,

thanks a lot for all the help. The run was now finished successfully. and the files seems to be complete.

Do you maybe have also another workflow with more details on how to continue now downstream with the Seurat or singleCellExperiment packages? something similar to how you showed in your paper?

thanks for everything

Assa

bvaldebenitom commented 1 year ago

That's great!

As for a downstream analysis, we have a brief protocol described here. Overall, if you are using Single Cell data that has been already annotated, with recommend to use that as metadata to check the clustering results, and do a marker analysis.

yeroslaviz commented 1 year ago

thanks for that.

Just a comment about the features file. When trying to convert it into a seurat object it throws some warnings.

Warning: Feature names cannot have underscores ('_'), replacing with dashes ('-')
Warning: Feature names cannot have pipe characters ('|'), replacing with dashes ('-')

I saw that you use a lot of the pipe symbol in the soloTE IDs. maybe they can be replaced before hand?

bvaldebenitom commented 1 year ago

@yeroslaviz those warnings shouldn't interfere in any of the Seurat analysis.

In theory, generating a BED file with "-" instead of "|" should not interfere with the proper execution of the pipeline, but we wanted to avoid using "-" as some TEs contain that character in their name ("HERVK13-int").