clemgoub / dnaPipeTE

dnaPipeTE (for de-novo assembly & annotation Pipeline for Transposable Elements), is a pipeline designed to find, annotate and quantify Transposable Elements in small samples of NGS datasets. It is very useful to quantify the proportion of TEs in newly sequenced genomes since it does not require genome assembly and works on small datasets (< 1X).
49 stars 11 forks source link

Blast parsing errors at higher coverages #12

Closed estolle closed 6 years ago

estolle commented 6 years ago

Hi there

I've recently switched machines and thus have a fresh install of RM/blast and DnaPipeTE. I had a few successfull runs (except that it pretty much always says: join: contigsTrinityRM.sorted: No such file or directory)(cov 0.2/ 0.3X), but encounter some weird problem at the parsing of the blastoutput in runs with higher coverage (0.35/0.4X).

my command was: python3 ~/progz/dnaPipeTE/dnaPipeTE.py -input $FILE -output $F/DNAPIPETE_$NEWF -cpu $CPUs -genome_size 450000000 -genome_coverage $COV -samplenumber 3 mv landscape.pdf $F/DNAPIPETE$NEWF/ mv Rplots.pdf $F/DNAPIPETE_$NEWF/

Parsing blast3 output... rm: cannot remove '/scratch/scratchspace/QMUL_apocrita_temp_copy/007/mt_reduced/repeatcontent/DNAPIPETE_Srichteri_littleb_0.35/blast_out/int.reads_vs_annoted.blast.out': No such file or directory #######################################################

Estimation of Repeat content from blast outputs

####################################################### parsing blastout and adding RM annotations for each read... join: contigsTrinityRM.sorted: No such file or directory Done, results in: blast_out/blastout_final_fmtd_annoted #########################################

OK, lets build some pretty graphs

######################################### Drawing graphs... Error in read.table(paste(folder, file1, sep = "/")) : no lines available in input edit: this comes from the graph.R becasue the inputfile is non existent Execution halted

I cant really find the error. It seems parsing blastoutput 2 works but blastoutput3 is missing files particularly the step creating int.reads_vs_annoted.blast.out seems to fail since this file is missing sort -k1,1 -k12,12nr -k11,11n /scratch/repeatcontent/DNAPIPETE_Srichteri_bigB_0.40/blast_out/reads_vs_annoted.blast.out > /scratch/repeatcontent/DNAPIPETE_Srichteri_bigB_0.40/blast_out/int.reads_vs_annoted.blast.out

Any idea? Did others encounter the error? Funny thing is that if I run lower coverages from the same inputfile, then everything seems to work (still complaining about the join: contigsTrinityRM.sorted: No such file or directory tho).

just another hint for the next version: I think the landscapes plot is generated in the current DIR but should be better generated in the Output DIR. I had to softlink the blastparser.py in the current DIR, otherwise it would not find it.

clemgoub commented 6 years ago

Hello!

I'm sorry you're going through these annoying issues.

Looking at your command, it seems that you don't run dnaPipeTE from its install directory. I guess this is why you encounter the issue why the landscape and blastparser.py scripts. So far, to work properly, the best way to execute dnaPipeTE is to cd in the dnaPipeTE directory at the beginning of your job. Indeed, it'll be great to make all that work together in future version as you suggest, so it wont be necessary. Sorry for the inconvenience.

Regarding the join problem: it's a bug. Don't worry about it, I'll fix it ASAP. But this should not affect the result (especially the Count.txt and the graphs). These are leftovers of the old code. However a table called "blastout_final_fmtd_annoted" shouldn't be accurate then (but we never mentioned it in the manual because it was obsolete; I'm going either to fix it or remove it). Thanks for reporting it.

Now, for your problem with the blast on annotated repeat: one possibility is that RepeatMasker did not produce any output. Can you check if you have a Trinity.fasta.out and if it is not empty? Actually, this file "int.reads_vs_annoted.blast.out" is created during the blast2 step (reads vs annotated repeats) but for some reason the log displays it after the print of "blast3". Can you send me your complete log file? I will look in details also in int.

Thanks a lot and I hope we will be able to fix that quickly!

Best,

Clément

clemgoub commented 6 years ago

Hi Estole, is your issue still occurring?

estolle commented 6 years ago

Hi there

I moved the analysis to a different server with fresh installs of Repeatmasker, blast and DNApipeTE and I think its working ok now.

One thing I was wondering since a long time:

What are the different Trinity iteration do if the trinity.fasta is only the latest run?

And I seem to see some differences of the different blast outputs, i.e. between the numbers generated for the pie-chart and the the other quantification per element. Shouldn't they be identical?

clemgoub commented 6 years ago

Hi Estolle,

Thanks for the update! Glad you fixed it.

Regarding your questions

1) After each iteration of Trinity, dnaPipeTE keeps the reads mapped to the Inchworm contig (k-mer contigs) of Trinity and add a new "fresh" sample to start a new assembly. This enriches the reads for repeats one iteration after another. At the end, the last iteration produces a Trinity.fasta file with all reads used in the assembly in previous iterations + the one used from the new sample. In order to get an unbiased estimate of the repeat content, dnaPipeTE draw a new independent sample and maps it to the Trinity.fasa contigs (reads from the assembly are not used).

2) It is normal. The piechart is actually more than a simple quantification of the contigs (like in the table "reads_per_components_and_annotation") by blasting the reads over all contigs. In the piechart case (file Counts.txt), dnaPipeTE maps i) the reads against the annotated contigs + the RepeatMasker library; ii) the unmapped reads are then mapped to the "NA" contigs and finally iii) the read not mapped after these two blasts are considered as "non repetitive". The goal is to retrieve as much annotation as possible at the class level. Thus, the exact contig from which the read belong is not expected but one similar enough to infer a class for the repeat. By contrast, in "read_per_component_and_annotation" as well as for the landscapes, the exact contig from which the read belongs need to be retrieved, so dnaPipeTE map at a single time the reads against the whole "Trinity.fasta" file.

Hope it helps, but let me know if you have further questions!

Cheers,

Clément