W-L / deviaTE

Python tool for the analysis and visualization of mobile genetic elements
GNU General Public License v3.0
19 stars 7 forks source link

IndexError: list index out of range #4

Closed vmerel closed 5 years ago

vmerel commented 5 years ago

Hi,

I got the following error when trying to use deviaTE_analyse:

Starting analysis of: Sequence in: Sample.fastq.fused.sort.bam no annotaions found for: Sequence Traceback (most recent call last): File "/var/lib/miniconda3/envs/deviaTE_env/bin/deviaTE_analyse", line 59, in sample.perform_pileup(hq_threshold=args.hq_threshold) File "/var/lib/miniconda3/envs/deviaTE_env/lib/python3.6/site-packages/deviaTE/deviaTE_pileup.py", line 84, in perform_pileup pr.count_nucleotide(sample_sites=self.sites) File "/var/lib/miniconda3/envs/deviaTE_env/lib/python3.6/site-packages/deviaTE/deviaTE_pileup.py", line 356, in count_nucleotide site = sample_sites[self.column_pos] IndexError: list index out of range

Here is the command line: deviaTE_analyse --input $bam --family $Fam --library $Library --single_copy_genes 1,2,3

I am using ubunu, samtools 1.7 (using htslib 1.7-2), bwa Version: 0.7.17-r1188

Can you help me ?

Vincent

W-L commented 5 years ago

Hello Vincent, from the command line you used I assume you want to analyze a .bam file that contains reads that were already mapped to consensus sequences of TEs?

--input $bam

If that is the case, you have to specify your input file with --input_bam $bam instead. Please let me know if this solves your issue. Otherwise it would be great if you could share what your bash variables evaluate to. Best wishes

vmerel commented 5 years ago

Thanks you for your answer.

If I try with --input_bam :

usage: deviaTE_analyse [-h] --input INPUT --family FAMILY [--library LIBRARY] [--output OUTPUT] [--sample_id SAMPLE_ID] [--annotation ANNOTATION] [--no_freq_corr] [--hq_threshold HQ_THRESHOLD] [--rpm | --single_copy_genes SINGLE_COPY_GENES] deviaTE_analyse: error: the following arguments are required: --input

W-L commented 5 years ago

Ah, my mistake. I overlooked that you are using deviaTE_analyse instead of the wrapper script deviaTE. In that case it is hard to tell what is going on. A guess would be that a read was aligned to coordinates that are larger than the length of the reference sequence, which should not happen. Or maybe the detection of internal deletions has messed up and declared a deletion outside of the reference range. Would you mind sharing your --input and --library file or a sample thereof that recreates the error, as well as your argument to --family? Then I can investigate further.

vmerel commented 5 years ago

Here you can find a fastq, a bam, and a subset of the library (if it can help you the last sequence produced an output no the others): https://filesender.renater.fr/?s=download&token=c58ff9ac-e0a7-73f2-7337-62b489eb4b73

W-L commented 5 years ago

Thank you for providing the files! There seem to be two issues here. One technical and one related to your library file.

  1. technical problem Did you install deviaTE with the conda environment? It seems that some functionality in conda has changed and it installs an older version of the tool. I am trying to figure out why that is at the moment. Could you run conda list | grep 'deviate' from within the environment and report the version number that comes up? If it is anything other than 0.3.7, then you will have to set up the conda environment again with an exact specification of the version number. Like this: conda create deviaTE==0.3.7 -c r -c defaults -c conda-forge -c bioconda -c w-l -n deviaTE_env Sorry about that!

  2. the first fasta sequence in your library contains the symbol / in the header, which causes a problem when creating the output files since / is the separator for filepaths. After replacing / with another symbol e.g. _ the reads will have to be remapped. I tested it on your data by running: deviaTE --library D_Tak.short_replaced.fa --families ALL --input_fq D_Tak_R1.fastq the keyword ALL with the argument --families automatically runs the analysis for all families in the library file) I will put an automatic replacement of / into the next version of the tool. Alternatively, you could also replace the symbol in the library as well as in the name of the reference sequence within your mapped bam file (e.g. with sed). That way the reads would not have to be remapped.

I hope this helps, please let me know

vmerel commented 5 years ago

Thank you for your answer.

  1. Yes I installed deviaTE with the conda environment. conda list | grep 'deviate' deviate 0.2.1.1 py36_2 w-l

Ok so I started again using the v0.3.7 and replacing "/" in my library, and everything seems to work fine ! Thank you !

I just have a quick question, for some sequences I got this: Reference sequence contains ambiguous nucleotide: W Reference sequence contains ambiguous nucleotide: K Reference sequence contains ambiguous nucleotide: W Reference sequence contains ambiguous nucleotide: M

Do you have any advice on how dealing with this, knowing that for the moment I am more interested in abundance comparison between samples (more than sequence divergence) ? I thought about replacing these by "N", but I don't know if advised and/or necessary ...

Vincent.

W-L commented 5 years ago

Great to hear that, thanks for the reply!

Concerning the warning about ambiguous nucleotides: This means that at certain positions in the reference/consensus sequence of the TE, the reference nucleotide is one of the letters that represents multiple, amibiguous nucleotides. All nucleotides that map to this position count towards coverage normally, like at any other position in the sequence. If you are only interested in the abundance, then this is not an issue at all and you do not have to replace the ambiguous nucleotides. Essentially it only means that this site can not be identified as a reference SNP (a SNP, where the reference nucleotide has been completely replaced by another one). But it will still be identified as a polymorphic SNP with the same conditions independent of the reference nucleotide (min. 10% of the total counts at that position and a minimum of 10% frequency).