Errors during the analysis execution

alanlorenzetti commented 7 years ago

First of all, I would like to say that I'm trying to use the tool to find new insertion sites in prokaryotic genomes, and I'm violating some of the requirements of this software:

i) I'm using paired-end RNASeq data and not DNASeq; ii) This libraries have no fixed insert length, since in this experiment we are using a range of insert sizes (max 600 nt).

Despite that, I have the software properly set up and working, and I got results employing this approach. However, I got some errors during the execution and I would like to know if they are signs that analysis was compromised.

The first error occurs in the beginning of execution:

Traceback (most recent call last): File "/home/alorenzetti/bin/RelocaTE2/scripts/relocaTE_trim.py", line 462, in main() File "/home/alorenzetti/bin/RelocaTE2/scripts/relocaTE_trim.py", line 287, in main coord = parse_align_blat(align_file, tandem_file, verbose) File "/home/alorenzetti/bin/RelocaTE2/scripts/relocaTE_trim.py", line 46, in parse_align_blat next(filehd) StopIteration

The second one occurs in the end (Step 6):

Traceback (most recent call last): File "/home/alorenzetti/bin/RelocaTE2/scripts/clean_false_positive.py", line 134, in main() File "/home/alorenzetti/bin/RelocaTE2/scripts/clean_false_positive.py", line 130, in main Overlap_TE_boundary(os.path.splitext(args.input)[0], args.refte, args.distance, args.bedtools) File "/home/alorenzetti/bin/RelocaTE2/scripts/clean_false_positive.py", line 79, in Overlap_TE_boundary idx, value = re.split(r'\=', attr) ValueError: too many values to unpack

Can you help me with this issue?

Cheers

JinfengChen commented 7 years ago

Not sure what's happening for the first one. But the second one is due to extra "=" in the last column of gff file, which could be a ID of repeat name.

For example, the last column of gff should have structure like this, "ID=repeat_Chr3_2077949_2077951;Name=karma;". If you have a repeat name "Name=karma=1;", you will have the error: too many values to unpack.

Please check if you have extra "=" in "OUTDIR/repeat/results/ALL.all_nonref_insert.raw.gff".

alanlorenzetti commented 7 years ago

Thank you very much. The first one doesn't occur in every sequencing library, so I don't think is a real problem. I didn't find any "=" character in repeat names, but there are other characters (e.g. ":" and "_"). I'm going to test the removal of these characters and report back.

alanlorenzetti commented 7 years ago

I removed all special characters from TE names and now the error is gone.

However, there is still an error in execution:

Traceback (most recent call last): File "/home/alorenzetti/bin/RelocaTE2/scripts/clean_false_positive.py", line 134, in main() File "/home/alorenzetti/bin/RelocaTE2/scripts/clean_false_positive.py", line 130, in main Overlap_TE_boundary(os.path.splitext(args.input)[0], args.refte, args.distance, args.bedtools) File "/home/alorenzetti/bin/RelocaTE2/scripts/clean_false_positive.py", line 79, in Overlap_TE_boundary idx, value = re.split(r'\=', attr) ValueError: too many values to unpack

Is this error related to getting rid of insertions matching known insertions provided by the user? Can you help me to solve it?

Also, I would like to ask another question:

Why there is no TE name on insertion records that were found only with supporting reads? The name is reported as "repeat_name" in these cases. It also reports a range that is the input insert plus twenty percent of insert size (size + size * 0.2). Can I say that an unknown insertion occurs in the region within this range?

JinfengChen / RelocaTE2

Errors during the analysis execution #8