bvaldebenitom / SoloTE

GNU General Public License v3.0
23 stars 6 forks source link

SoloTE 1.09 Issue ValueError: Length mismatch: Expected axis has 4 elements, new values have 6 elements #30

Closed BrunoGuillotin closed 2 weeks ago

BrunoGuillotin commented 9 months ago

Dear SoloTE Community,

I used the new version of SoloTE 1.09 and got the following error. Using Python 3.10.0 on a University cluster so Ubuntu I guess

Do you know what could be the issue ?

Here is the run message

Loading module 'gencore/1' [bam_sort_core] merging from 0 files and 4 in-memory blocks... SoloTE started at 21:22:34 [OK] samtools found! [OK] bedtools found! SoloTE v1.09 started! SoloTE Home directory /scratch/gencore/software/SoloTE-1.09 SoloTE executed from /scratch/bg93/SoloTE Results will be stored in /scratch/bg93/SoloTE Input BAM file: /scratch/bg93/Sorghum_Nuclei_IR13_ER8_Ctr_Aux_Fastq_Align_andTE_bis/35_ER8_Control_cellrangerCount/outs/possorted_genome_bam.bam Input TE BED file: /scratch/bg93/SoloTE/Panda_AT-TEs_annotation_v1.0_allTEs_SoloTE2.bed Currently working in temporary directory: /scratch/bg93/SoloTE/SoloTE_SoloTE_temp samtools view -@ 4 -O BAM -o SoloTE_nogenes_overlappingtes.bam -L /scratch/bg93/SoloTE/Panda_AT-TEs_annotation_v1.0_allTEs_SoloTE2.bed -e '(exists([CB]) && exists([UB]) && [CB]!="-" && [UB]!="-") && (!exists([GN]) || [GN]=="-")' /scratch/bg93/Sorghum_Nuclei_IR13_ER8_Ctr_Aux_Fastq_Align_andTE_bis/35_ER8_Control_cellrangerCount/outs/possorted_genome_bam.bam samtools index SoloTE_nogenes_overlappingtes.bam bedtools bamtobed -i SoloTE_nogenes_overlappingtes.bam -split > SoloTE_nogenes_overlappingtes.bed bedtools intersect -a /scratch/bg93/SoloTE/Panda_AT-TEs_annotation_v1.0_allTEs_SoloTE2.bed -b SoloTE_nogenes_overlappingtes.bed -u > SoloTE_selectedtes.bed python /scratch/gencore/software/SoloTE-1.09/annotateBAM.py SoloTE_nogenes_overlappingtes.bam SoloTE_selectedtes.bed temp_annotated_te.bam 1 samtools sort -@ 4 -O BAM -o SoloTE_teannotated.bam temp_annotated_te.bam samtools merge --threads 4 -o - /scratch/bg93/Sorghum_Nuclei_IR13_ER8_Ctr_Aux_Fastq_Align_andTE_bis/35_ER8_Control_cellrangerCount/outs/possorted_genome_bam.bam SoloTE_teannotated.bam|samtools view -@ 4 -O BAM -o SoloTE_final.bam -e 'exists([CB]) && exists([UB]) && exists([GN]) && [CB]!="-" && [UB]!="-" && [GN]!="-"' --keep-tag GN,CB,UB samtools index SoloTE_final.bam Counts for chromosome 2 are being generated in process: 3588534 Counts for chromosome Mt are being generated in process: 3588534 Counts for chromosome Pt are being generated in process: 3588534 Counts for chromosome 1 are being generated in process: 3588533 Counts for chromosome 3 are being generated in process: 3588536 Counts for chromosome 4 are being generated in process: 3588535 Counts for chromosome 5 are being generated in process: 3588535 Traceback (most recent call last): File "/scratch/gencore/software/SoloTE-1.09/SoloTE_pipeline.py", line 221, in te_final_df.columns=['TE_original_id','barcode','counts','Subfamily','Family','Class'] File "/scratch/gencore/conda3/envs/soloTE/lib/python3.10/site-packages/pandas/core/generic.py", line 6216, in setattr return object.setattr(self, name, value) File "properties.pyx", line 69, in pandas._libs.properties.AxisProperty.set File "/scratch/gencore/conda3/envs/soloTE/lib/python3.10/site-packages/pandas/core/generic.py", line 767, in _set_axis self._mgr.set_axis(axis, labels) File "/scratch/gencore/conda3/envs/soloTE/lib/python3.10/site-packages/pandas/core/internals/managers.py", line 227, in set_axis self._validate_set_axis(axis, new_labels) File "/scratch/gencore/conda3/envs/soloTE/lib/python3.10/site-packages/pandas/core/internals/base.py", line 85, in _validate_set_axis raise ValueError( ValueError: Length mismatch: Expected axis has 4 elements, new values have 6 elements

Here is the head of the bam file I used.

A00534:58:HLWVMDRXX:2:2227:5321:14215 272 1 513 0 8M312704N86M 0 0 CTTGTCATAGATACCCTTCAACTTAATATATAAAAAGCTGATTAAGTCGTAAAACAACGATTTCTTGGTAGATATGAGTCTATGTAGTCATGAT :FFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:3 HI:i:3 AS:i:89 nM:i:0 RE:A:I li:i:0 BC:Z:GCCTTGGT QT:Z:FFFFFFFF CR:Z:GAGCTGCAGAACGTGC CY:Z:FFFFFFFFFF:FFFFF CB:Z:GAGCTGCAGAACGTGC-1 UR:Z:ACTGTTCCAGGA UY:Z:FFFFFFFFFFFF UB:Z:ACTGTTCCAGGA RG:Z:35_cellrangerCount:0:1:HLWVMDRXX:2 A00534:58:HLWVMDRXX:2:2102:25617:24752 16 1 941 255 66M28S 0 0 ATATACCCGTTATTCCCACAATCATATGCTTTCTAAAAGCAAAAGTATATGTCAACAATTGGTTATCCCATGTACTCTGCGTTGATACCACTGC FFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,F NH:i:1 HI:i:1 AS:i:64 nM:i:0 RE:A:I xf:i:0 li:i:0 BC:Z:AAACCTCA QT:Z::F:FFFFF CR:Z:CACTGAAGTACAAGCG CY:Z:FFFFFFFFFFFFFFFF CB:Z:CACTGAAGTACAAGCG-1 UR:Z:GATGCTCAAGCC UY:Z:FFFFFFFFFFFF UB:Z:GATGCTCAAGCC RG:Z:35_cellrangerCount:0:1:HLWVMDRXX:2 A00534:58:HLWVMDRXX:1:2201:12707:32925 16 1 1073 255 7S20M171168N67M * 0 0 CGGCCGCCCCCCACCCGCCCCCCCCACAACGATGTGTGGTCTTATATATAGGTTTTGTCGATCAGAAGAAATCTTTAAAAGTAGGAAGTTTGAT :,,:F,,,F,,F,FF,:,F,FF,,F,,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:1 HI:i:1 AS:i:69 nM:i:3 RE:A:I xf:i:0 li:i:0 BC:Z:CTGGACTC QT:Z:FFFFFFFF CR:Z:GATTGGTGTGTTAAAG CY:Z:FFFFFFFFFFFFFFFF CB:Z:GATTGGTGTGTTAAAG-1 UR:Z:AATCCGTGTGGG UY:Z:F:FFFFFFFFFF UB:Z:AATCCGTGTGGG RG:Z:35_cellrangerCount:0:1:HLWVMDRXX:1

Thank you very much in advance for your help,

Bruno

bvaldebenitom commented 9 months ago

Hi @BrunoGuillotin ,

could you share the output of the following commands? head /scratch/bg93/SoloTE/Panda_AT-TEs_annotation_v1.0_allTEs_SoloTE2.bed ls -lht /scratch/bg93/SoloTE/SoloTE_SoloTE_temp

BrunoGuillotin commented 9 months ago

Hi, Thanks for the super fast reply

Sure head /scratch/bg93/SoloTE/Panda_AT-TEs_annotation_v1.0_allTEs_SoloTE2.bed 1 11896 11976 1|11896|11976|ATCOPIA24|Copia|LTR|+ 1 16882 17009 1|16882|17009|ATREP4|Helitron|RC|- 1 17023 18924 1|17023|18924|ATREP3|Helitron|RC|+ 1 18330 18642 1|18330|18642|ATHATN7|HAT|DNA|- 1 55675 56576 1|55675|56576|SIMPLEHAT1|HAT|DNA|+ 1 76843 77500 1|76843|77500|TA11|L1|LINE|+ 1 78287 78785 1|78287|78785|HELITRONY1D|Helitron|RC|- 1 154330 154418 1|154330|154418|ATLINE1_3A|L1|LINE|- 1 193673 194263 1|193673|194263|ATREP5|Helitron|RC|- 1 194263 194300 1|194263|194300|ATREP3|Helitron|RC|-

ls -lht /scratch/bg93/SoloTE/SoloTE_SoloTE_temp -rw-r--r-- 1 bg93 users 1.1G Sep 14 22:33 SoloTE_allcounts.txt -rw-r--r-- 1 bg93 users 243M Sep 14 22:33 SoloTE_countpercell_5.counts -rw-r--r-- 1 bg93 users 261M Sep 14 22:27 SoloTE_countpercell_1.counts -rw-r--r-- 1 bg93 users 212M Sep 14 22:25 SoloTE_countpercell_3.counts -rw-r--r-- 1 bg93 users 2.1M Sep 14 22:23 SoloTE_countpercell_Pt.counts -rw-r--r-- 1 bg93 users 5.9M Sep 14 22:23 SoloTE_countpercell_Mt.counts -rw-r--r-- 1 bg93 users 166M Sep 14 22:23 SoloTE_countpercell_2.counts -rw-r--r-- 1 bg93 users 155M Sep 14 22:22 SoloTE_countpercell_4.counts -rw-r--r-- 1 bg93 users 727K Sep 14 22:15 SoloTE_final.bam.bai -rw-r--r-- 1 bg93 users 5.6G Sep 14 22:14 SoloTE_final.bam -rw-r--r-- 1 bg93 users 39M Sep 14 21:43 SoloTE_teannotated.bam -rw-r--r-- 1 bg93 users 41M Sep 14 21:43 temp_annotated_te.bam -rw-r--r-- 1 bg93 users 854K Sep 14 21:40 SoloTE_selectedtes.bed -rw-r--r-- 1 bg93 users 177M Sep 14 21:40 SoloTE_nogenes_overlappingtes.bed -rw-r--r-- 1 bg93 users 210K Sep 14 21:40 SoloTE_nogenes_overlappingtes.bam.bai -rw-r--r-- 1 bg93 users 87M Sep 14 21:39 SoloTE_nogenes_overlappingtes.bam

Bru

bvaldebenitom commented 9 months ago

You are welcome!

Could you change the formatting of the BED file from 1 11896 11976 1|11896|11976|ATCOPIA24|Copia|LTR|+ to 1 11896 11976 chr1|11896|11976|ATCOPIA24:Copia:LTR|+ ?

Please note that TE information is separated with : and there is an added "chr" to the beginning of column 4. SoloTE depends on both aspects incorporated in column 4 for creating the count matrices. Right now the error seems to be related to the : character missing. The "chr" is required to identify locus-specific TEs.

In awk, the following command should do it: awk 'BEGIN{FS=OFS="\t"}{split($4,a,"|");print $1,$2,$3,"chr"a[1]"|"a[2]"|"a[3]"|"a[4]":"a[5]":"a[6]"|"a[7]}' /scratch/bg93/SoloTE/Panda_AT-TEs_annotation_v1.0_allTEs_SoloTE2.bed > /scratch/bg93/SoloTE/Panda_AT-TEs_annotation_v1.0_allTEs_SoloTE2_fixedIDs.bed

Once this is done, delete the current output directory and re-run the pipeline with the updated BED file.

BrunoGuillotin commented 9 months ago

Hi @bvaldebenitom,

Thanks for the reply, it works like a charm. Thanks for this super tool I am excited to analyse the results.

I would advice to maybe add the specific point of the "chr" at the beginning of column 4 in your Readme file. For example, "Make sure to add "chr" in front on the chromosome number at beginning of column 4. The first column should contain the same chromosome nomenclature as the genome of reference" In my genome chromosome are just numbers. I also thought that the term "sequenceName" was a bit confusing, maybe "chromosome/contig name" would be better ?

I just have a few questions:

1-Regarding the output could you precise the difference between SoloTE_legacytes_MATRIX SoloTE_locustes_MATRIX The other matrices are self explanatory to me I think.

2- Also I got an error when I tried to run the pipeline with the argument --dual Does the default setting include dual (ie: counting TE in genes) ?

3- Could you explain the argument [--minoverlap MINOVERLAP] ?

4- In the output, would it be possible to get the geneID and not the gene names ? Or one column gene ID the other gene name. This is a big issue in plants different plant species as our genomes are quite bad and multiple genes have the same name ( I know it's stupid ^^).

Sorry for all these questions and thanks again in advance for your help,

Bruno

bvaldebenitom commented 9 months ago

@BrunoGuillotin :

thanks for the feedback and kind words. We will add your suggestions for the next release!

Please see below for a point-by-point response:

1-Regarding the output could you precise the difference between SoloTE_legacytes_MATRIX SoloTE_locustes_MATRIX The other matrices are self explanatory to me I think.

Legacy TEs correspond to the original behavior of the pipeline, in which reads mapping uniquely to TEs contribute locus-specific counts, and will appear with the full locus identifier in the matrix (i.e. SoloTE|chr|Start|End|TEfamily). On the other hand, reads mapping to multiple locations contribute to subfamily counts, and those are summarized without locus information, and also included in the matrix (i.e., SoloTE|TEfamily).

The new Locus TEs only include uniquely mapped reads, and thus, only locus results in the matrix, without the subfamily counts described above. Please see the following figure for an overview on this: figure_solote_legacyvslocus

2- Also I got an error when I tried to run the pipeline with the argument --dual Does the default setting include dual (ie: counting TE in genes) ?

Can you share the error you get?

Dual is not the default setting. However, using SoloTE, you might still detect reads from intronic TEs. This depends on how you generated the BAM file though. For example, latest versions of CellRanger annotate intronic reads to the gene by default, and if it is annotated to a gene, SoloTE will not analyze them.

3- Could you explain the argument [--minoverlap MINOVERLAP] ?

This parameter defines the minimum number of base pairs (bp) to count a read towards a TE. The default value is 1, meaning that at least 1 bp should overlap between the read and the TE to count it. In preliminary tests, I noticed that some TEs have only a small overlap with reads, and thus, increasing the minoverlap value helped to diminish spurious results.

4- In the output, would it be possible to get the geneID and not the gene names ? Or one column gene ID the other gene name. This is a big issue in plants different plant species as our genomes are quite bad and multiple genes have the same name ( I know it's stupid ^^).

This would depend on whether the gene ID is stored in the BAM file. In most cases this should be available. Would it be possible for you to share the BAM file you are using? If it helps, you can generate a subset of the first chromosome with SAMtools: samtools view /scratch/bg93/Sorghum_Nuclei_IR13_ER8_Ctr_Aux_Fastq_Align_andTE_bis/35_ER8_Control_cellrangerCount/outs/possorted_genome_bam.bam 1

With this I can run some tests by adding a feature that will allow for this behavior.

Sorry for all these questions and thanks again in advance for your help,

Bruno

No worries. Hope the answers are helpful.

Best, Braulio.

BrunoGuillotin commented 9 months ago

Hi Braulio,

Amazing explanations thanks.

2- Here is the error i get, when I add --dual to the command. The error shows up in 3-4 seconds after the beginning of the run Command

python /scratch/gencore/software/SoloTE-1.09/SoloTE_pipeline.py --threads 4 \
 --bam /scratch/bg93/Sorghum_Nuclei_IR13_ER8_Ctr_Aux_Fastq_Align_andTE_bis/35_ER8_Control_cellrangerCount/outs/possorted_genome_bam.bam \
 --teannotation /scratch/bg93/SoloTE/Panda_AT-TEs_annotation_v1.0_allTEs_SoloTE2_fixedIDs.bed \
  --outputprefix SoloTE --outputdir /scratch/bg93/SoloTE/test3/ --dual

Output Error

sh: -c: line 0: unexpected EOF while looking for matching '' sh: -c: line 1: syntax error: unexpected end of file [E::hts_open_format] Failed to open file "SoloTE_nogenes_overlappingtes.bam" : No such file or directory samtools index: failed to open "SoloTE_nogenes_overlappingtes.bam": No such file or directory [E::hts_open_format_impl] Failed to open file SoloTE_nogenes_overlappingtes.bam Failed to open BAM file SoloTE_nogenes_overlappingtes.bam [E::hts_open_format] Failed to open file "SoloTE_nogenes_overlappingtes.bam" : No such file or directory Traceback (most recent call last): File "/scratch/gencore/software/SoloTE-1.09/annotateBAM.py", line 5, in samfile = pysam.AlignmentFile(inputfile,"rb") File "pysam/libcalignmentfile.pyx", line 747, in pysam.libcalignmentfile.AlignmentFile.cinit File "pysam/libcalignmentfile.pyx", line 946, in pysam.libcalignmentfile.AlignmentFile._open FileNotFoundError: [Errno 2] could not open alignment file SoloTE_nogenes_overlappingtes.bam: No such file or directory [E::hts_open_format] Failed to open file "temp_annotated_te.bam" : No such file or directory samtools sort: can't open "temp_annotated_te.bam": No such file or directory [E::hts_open_format] Failed to open file "SoloTE_teannotated.bam" : No such file or directory samtools merge: fail to open "SoloTE_teannotated.bam": No such file or directory [main_samview] fail to read the header from "-". [E::hts_open_format] Failed to open file "SoloTE_final.bam" : No such file or directory samtools index: failed to open "SoloTE_final.bam": No such file or directory [E::hts_open_format] Failed to open file "SoloTE_final.bam" : No such file or directory SoloTE started at 06:22:43 [OK] samtools found! [OK] bedtools found! SoloTE v1.09 started! SoloTE Home directory /scratch/gencore/software/SoloTE-1.09 SoloTE executed from /scratch/bg93/SoloTE Results will be stored in /scratch/bg93/SoloTE/test3 Input BAM file: /scratch/bg93/Sorghum_Nuclei_IR13_ER8_Ctr_Aux_Fastq_Align_andTE_bis/35_ER8_Control_cellrangerCount/outs/possorted_genome_bam.bam Input TE BED file: /scratch/bg93/SoloTE/Panda_AT-TEs_annotation_v1.0_allTEs_SoloTE2_fixedIDs.bed Dual mode enabled. SoloTE will calculate TE expression also considering reads annotated to genes. Currently working in temporary directory: /scratch/bg93/SoloTE/test3/SoloTE_SoloTE_temp samtools view -@ 4 -O BAM -o SoloTE_nogenes_overlappingtes.bam -L /scratch/bg93/SoloTE/Panda_AT-TEs_annotation_v1.0_allTEs_SoloTE2_fixedIDs.bed -e '(exists([CB]) && exists([UB]) && [CB]!="-" && [UB]!="-") /scratch/bg93/Sorghum_Nuclei_IR13_ER8_Ctr_Aux_Fastq_Align_andTE_bis/35_ER8_Control_cellrangerCount/outs/possorted_genome_bam.bam samtools index SoloTE_nogenes_overlappingtes.bam bedtools bamtobed -i SoloTE_nogenes_overlappingtes.bam -split > SoloTE_nogenes_overlappingtes.bed bedtools intersect -a /scratch/bg93/SoloTE/Panda_AT-TEs_annotation_v1.0_allTEs_SoloTE2_fixedIDs.bed -b SoloTE_nogenes_overlappingtes.bed -u > SoloTE_selectedtes.bed python /scratch/gencore/software/SoloTE-1.09/annotateBAM.py SoloTE_nogenes_overlappingtes.bam SoloTE_selectedtes.bed temp_annotated_te.bam 1 samtools sort -@ 4 -O BAM -o SoloTE_teannotated.bam temp_annotated_te.bam samtools merge --threads 4 -o - /scratch/bg93/Sorghum_Nuclei_IR13_ER8_Ctr_Aux_Fastq_Align_andTE_bis/35_ER8_Control_cellrangerCount/outs/possorted_genome_bam.bam SoloTE_teannotated.bam|samtools view -@ 4 -O BAM -o SoloTE_final.bam -e 'exists([CB]) && exists([UB]) && exists([GN]) && [CB]!="-" && [UB]!="-" && [GN]!="-"' --keep-tag GN,CB,UB samtools index SoloTE_final.bam Traceback (most recent call last): File "/scratch/gencore/software/SoloTE-1.09/SoloTE_pipeline.py", line 187, in for chromosome_info in pysam.idxstats(final_bam).split("\n"): File "/scratch/gencore/conda3/envs/soloTE/lib/python3.10/site-packages/pysam/utils.py", line 83, in call raise SamtoolsError( pysam.utils.SamtoolsError: 'samtools returned with error 1: stdout=, stderr=samtools idxstats: failed to open "SoloTE_final.bam": No such file or directory\n'

4- Here are two lines from my bam file from CellRanger 7.0.0, the first read map to the gene ID AT1G01050.1 and gene name PPA1. The second read to gene ID AT5G01010 (multiple splice variants, ie: .1 .3 .4 etc), gene name AT5G01010 (not every gene have names in the genome)

A00534:58:HLWVMDRXX:1:2157:24071:34507  16  1   31260   255 94M *   0   0   ACCAAATGATAAGGAAGAAAAATGCAGAGATAAAAGTAATATCAATTAGGATCATATGCTTCTTATTATCAATGAAAAGTAACAGAAACATAGA  FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFF  NH:i:1  HI:i:1  AS:i:92 nM:i:0  TX:Z:AT1G01050.1,+810,94M   GX:Z:AT1G01050  GN:Z:PPA1   fx:Z:AT1G01050  RE:A:E  xf:i:17 li:i:0  BC:Z:GCCTTGGT   QT:Z:FFFFFFFF   CR:Z:AGGAGTTCACGGACCC   CY:Z:FFFFFFFFFFFFFFFF   UR:Z:GCCTACTTATAT   UY:Z:FFFFFFFFFFFF   UB:Z:GCCTACTTATAT   RG:Z:35_cellrangerCount:0:1:HLWVMDRXX:1
A00534:58:HLWVMDRXX:1:2168:11659:5494   1040    5   1342    255 94M *   0   0   AATGATGAGTGTTCATGACTGTAAAATCCACCTTCATCTTCCACTTTCAGTTTAACGGCTCCGGCTCTGGATCCGGCTCCACTACTGGTGCTAT  FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:F  NH:i:1  HI:i:1  AS:i:92 nM:i:0  TX:Z:AT5G01010.1,+1501,94M;AT5G01010.3,+1500,94M;AT5G01010.4,+1508,94M;AT5G01010.5,+1462,94M    GX:Z:AT5G01010  GN:Z:AT5G01010  fx:Z:AT5G01010  RE:A:E  li:i:0  BC:Z:GCCTTGGT   QT:Z:FFFFFFFF   CR:Z:GATGGAGCAGGCTTGC   CY:Z:FFFFFFFFFFFFFFFF   CB:Z:GATGGAGCAGGCTTGC-1 UR:Z:TCTTCCCCAAGC   UY:Z:FFFFFFFFFFFF   UB:Z:TCTTCCCCAAGC   xf:i:17 RG:Z:35_cellrangerCount:0:1:HLWVMDRXX:1

As I said it would be nice to get both gene name and gene ID in the output, but no rush.

Thanks in advance, And have a good weekend,

Bru

github-actions[bot] commented 2 weeks ago

This issue is stale because it has been open for 10 days with no activity.

github-actions[bot] commented 2 weeks ago

This issue was closed because it has been inactive for 14 days since being marked as stale.