RemiAllio / MitoFinder

MitoFinder: efficient automated large-scale extraction of mitogenomic data from high throughput sequencing data
86 stars 14 forks source link

Warning: None of paired reads aligned properly #25

Closed jbernst closed 2 years ago

jbernst commented 2 years ago

Hi,

I am working with formalin (degraded) samples in which I have 30-50 millions of reads from paired-end data. I know the degraded samples are a hassle to begin with and may not yield results, but I am trying to see if I can obtain any mitochondrial data from some sequence capture data. I had one of these work well and I obtained two mitochondrial genes, but another 7 samples I got the following errors:

======= SPAdes pipeline finished WITH WARNINGS!

=== Error correction and assembling warnings:
 * 0:51:23.455     8G / 9G    WARN    General                 (pair_info_count.cpp       : 341)   Unable to estimate insert size for paired library #0
 * 0:51:23.455     8G / 9G    WARN    General                 (pair_info_count.cpp       : 347)   None of paired reads aligned properly. Please, check orientation of your read pairs.
 * 0:51:23.458     8G / 9G    WARN    General                 (repeat_resolving.cpp      :  63)   Insert size was not estimated for any of the paired libraries, repeat resolution module will not run.
 * 1:25:43.121     8G / 8G    WARN    General                 (pair_info_count.cpp       : 175)   Single reads are not used in metagenomic mode
======= Warnings saved to /Volumes/BackupPlus/Mitolopsidae/Cryn_370/Cryn_370_metaspades/warnings.log

Is the orientation error possibly due to fragmented DNA? The first time I ran MitoFinder it worked fine, and I have have my R1.fastq.gz for the -1flag and the R2.fastq.gz file for the -2 flag. I'm trying to figure out if the issue with 'unable to estimate insert size' and the other warnings are due to this just being form a degraded sample. Do you have any suggestions?

Lastly, I assume I should subsample these from 30-50 million reads to ~10 million reads if I want to run these on a local desktop (8 cores, 32 GB RAM) vs, the cluster, which is having issues? Thank you!

My script for running this was the following: mitofinder --metaspades -j Cryn_370 -1 Cryn_370/6112-JK-12_16_S1_L005_R1_001.fastq.gz -2 Cryn_370/6112-JK-12_16_S1_L005_R2_001.fastq.gz -r /Volumes/BackupPlus/HP_mtDNAref.gb -o 2 -p 4 -m 22

I've tried running this on the cluster, but I keep getting the error of: -bash: /home/jmb689/ncbi-blast-2.12.0+/bin/makeblastdb: cannot execute binary file even though I have the / at the end of the word /bin/ (I did it as an installation for linux; our cluster couldn't install git-lfs).

Best, Justin

RemiAllio commented 2 years ago

Hi Justin,

It seems that metaspades is unable to pair your reads. Like the R1 and the R2 reads are not from the same sequencing run. I am pretty sure that is not due to the quality of your sample. To solve your issue you need to found out what is wrong with your input data. If everything is fine, something is wrong with metaspades: is the maximum memory reached? Did you try to run MEGAHIT?

You can subsample your data, but again, you must extract the same number of paired reads from R1 and R2 and not randomly subsample. In fact, if you have a read named "reads1" in file R1, you must have a read named "read1" in file R2.

Your script looks good :-)

If you want to run it on a linux cluster, I recommend you not to change the settings in MitoFinder.config. Here, it seems that the specified makeblastdb program is not compatible with your system. However, the blast binaries provided with MitoFinder should work.

There were many points on your previous post. I hope my answers help you, Best, Rémi

jbernst commented 2 years ago

Hi Rémi,

Thanks for getting back to me. So these raw files are ones that have been used for the Phyluce pipeline for UCEs, so as far as the input data goes, I haven't subsampled them yet. I haven't tried MEGAHIT yet, so I can try that next. I am wondering if it is a memory issue (especially with the large read size). Is there a recommendation for subsampling? 10 million reads total (5 million each?). And one last question: what do you mean that there must be "reads1" for file R1 and R2? Is it okay that I have mine as R1 for the -1 flag and R2 for the -2 flag? Or is this for subsampling?

For the cluster, it turns out that I had the wrong version of ncbi-blast+ installed (MAC OS, not linux), so I got them running. My config file is as follows:

###################################################
#               Version: 1.4                      #
###################################################

#folder for megahit
megahitfolder = /home/jmb689/miniconda3/envs/MitoFinder/bin/

#folder for Arwen
arwenfolder = /home/jmb689/MitoFinder/arwen/

#folder for Blast
blastfolder = /home/jmb689/ncbi-blast-2.12.0+/bin/

#folder for idba
idbafolder = /home/jmb689/idba/bin/

#folder for MetaSpades
metaspadesfolder = /home/jmb689/SPAdes-3.15.3-Linux/bin/

#folder for tRNAscan-SE
trnascanfolder = /home/jmb689/MitoFinder/trnascanSE/tRNAscan-SE-2.0/exec/bin/

#folder for MitFi
mitfifolder = /home/jmb689/MitoFinder/mitfi/

Thank you for the help!

Best, Justin

RemiAllio commented 2 years ago

Hi,

I was just saying that for paired reads, the names of the reads inside the files (R1 and R2) are (approximately) the same (modulo -2 /2 or .2 tags for example). For example, if in your R1 file you have a read named "@ESKKKK1000", you should have a read with a similar name in the R2 file too. That is an easy way to check if the two files (R1 and R2) are paired.

For subsampling, it depends on the sequencing method. For capture sequencing I would not recommend subsampling because mitochondrial reads are in low proportion. Then, for whole genome sequencing, 10 million reads total (5 million each) should be sufficient. If not, you can increase the number of reads and try again. It's difficult to anticipate ...

Sounds good for your cluster. Does MetaSPAdes work then? Have you run MitoFinder successfully in you cluster?

Best, Rémi

jbernst commented 2 years ago

Hi,

Okay, that makes sense. Is there a way to check the files? I assume using cat won't work due to the formatting of the fasqt.gz files. Would I need to unzip and then read them? It's odd that it thinks they aren't paired, as these are from the same run and worked for one of my other samples for MitoFinder.

I just got MitoFinder working in the cluster as of a few hours ago, so I can make an update here by next week (so I can run it on a few samples and see if I get the same error). I am using sequence capture, but they spiked the samples to increase the amount of mitochondrial bycatch, so the total number of reads per sample is between 30 and 50 million.

This was all using MetaSpades, but I am also going to try Megahit if I get the same errors again. I will report back. Thank you!

Best, Justin

RemiAllio commented 2 years ago

Hi,

Okay, I look forward to getting your feedback.

To simply "cat" gzipped files, you can use zcat :-)

Best, Rémi

jbernst commented 2 years ago

Hi Rémi,

So I ran some more analyses this weekend, this time with megahit. Megahit ran to completion on some, but for the others, I am not sure.

A few specimens megahit was able to find contigs and put the final results files into the ID_MitoFinder_megahit_mifi_Final_Results directory. That was for 4-5 specimens out of the 44. For the remainder, the log file said:

All results will be written here
Program folders:
MEGAHIT = /home/jmb689/miniconda3/envs/MitoFinder/bin/
Blast folder = /home/jmb689/ncbi-blast-2.12.0+/bin/
IDBA-UD folder = /home/jmb689/idba/bin/
MetaSPAdes folder = /home/jmb689/SPAdes-3.15.3-Linux/bin/
ARWEN folder = /home/jmb689/MitoFinder/arwen/
MiTFi folder = /home/jmb689/MitoFinder/mitfi/
tRNAscan-SE folder = /home/jmb689/MitoFinder/trnascanSE/tRNAscan-SE-2.0/exec/bin/

Starting Assembly step with MEGAHIT
Result files will be saved here:
/scratch/jmb689/MitoFinder_Homalopsids/Batch1/Bgas/Bgas_megahit/
Formatting database for mitochondrial contigs identification...
Running mitochondrial contigs identification step...
MitoFinder dit not found any mitochondrial contig

Despite this (and perhaps I am viewing this wrong), the Bgas_megahit directory lists the following:

Bgas_megahit.contigs.fa
Bgas_megahit.log
checkpoints.txt
done
intermediate_contigs/
options.json

With a large number of files in the intermediate contigs directory. Is this normal? Would this appear if megahit didn't complete on these? Or are these still products of a successful MitoFinder run that simply did not find any contigs. I can't say for certain, but I thought I would have had more mtDNA since these were enriched to capture more mtDNA.

I haven't tried Singularity version yet. I am happy to provide any files that would be needed for any further assistance.

Best, Justin

RemiAllio commented 2 years ago

Hi,

Sorry for the delay.

Okay, here it seems that everything worked but MitoFinder does not found a mitochondrial contig. Two possible explanations:

I hope this helps! Best, Rémi

jbernst commented 2 years ago

Thank you! I will check these out, and will also try a mapping strategy. The specimens that did work are actually ones that provide the most evolutionary information, so that's good. I will see what I can get from the rest :)

Best, Justin

RemiAllio commented 2 years ago

I am happy if MitoFinder helps you, even a little bit :-) Good luck with that! Best, Rémi