cmayer / MitoGeneExtractor

The MitoGeneExtractor can be used to extract protein coding mitochondrial genes, such as COI and others from short and long read sequencing libraries.
GNU Affero General Public License v3.0
6 stars 2 forks source link

Problem running MitoGeneExtractor with 2 input reads #6

Closed AlbertoPiris closed 5 months ago

AlbertoPiris commented 1 year ago

Good afternoon,

I am trying to analyze ultraconserved elements (UCEs) data with MitoGeneFinder. I have paired-end reads data (one R1 and R2 per sample) and I am trying to analyze only one sample as a test. I have followed the steps to generate the COI sequence for the taxon of interest. Subsequently, when running MitoGeneExtractor I get a problem. This is the code I used to run the analysis:

../../software/anaconda3/envs/mitogeneextractor/MitoGeneExtractor/MitoGeneExtractor/MitoGeneExtractor-v1.9.5 -q Ad_1_Gabiarra_PI-READ1.fastq Ad_1_Gabiarra_PI-READ2.fastq -p Alytes_COI_consensus.fasta -e ../.... /software/exonerate/exonerate-2.2.0-x86_64/bin/exonerate -V vulgar_Ad_1_Gabiarra_PI_Alytes_COI.txt -o Ad_1_Gabiarra_PI_align_Alytes_COI.fas -n 0 -c Ad_1_Gabiarra_PI_cons_Alytes_COI.fas -t 0.5 -r 1 -C 2

Subsequently, the terminal returns the following error: Welcome to the MitoGeneExtractor program, version 1.9.5 PARSE ERROR: Argument: Ad_1_Gabiarra_PI-READ2.fastq Couldn't find match for argument

I have also tried to concatenate both reads into one, and it seems that the analysis starts to run, but then returns another error: ERROR: The sequence names in the DNA read file are not unique when trimming them at the first space, which is done by many programs, including exonerate. Please verify your input data. Often it helps to rename sequences by replacing spaces with e.g. underscores.

What could I do to try to fix them? Any help would be appreciated, thank you very much!

marievalerie commented 1 year ago

Hi,

there are several small issues here, but I am sure we can fix them :)

The first problem is that for the newer version of MGE you have to specify the -q parameter twice, so in your case it would be ... MitoGeneExtractor-v1.9.5 -q Ad_1_Gabiarra_PI-READ1.fastq -q Ad_1_Gabiarra_PI-READ2.fastq -p Alytes_COI_consensus.fasta ... I saw that this is also wrong in the description here, sorry for that. I will fix it.

However, this will only lead to the error that you encountered when you've concatenated the files. This error is raised because your read names in the fastq file are not unique. This can happen, depending on the sequencing protocol and device.

Maybe we are lucky and this is only related to whitespace issues (as indicated by the error message). If you are using a linux machine, you could replace all whitespaces with e.g. underscores by typing sed 's, ,_,g' -i file.fastq (this will change the file in place! if you don't want that, then type sed 's, ,_,g' file.fastq > new_file.fastq ).

Do that for both files and try to run MGE again with the modified files. If it works, it should be fine. But it is likely that you still don't have unique read names. If you are unsure, you can also drop me the first lines of your read file via pasting the output of head Ad_1_Gabiarra_PI-READ1.fastq Ad_1_Gabiarra_PI-READ2.fastq.

In that case we would need to add a unique ID to the read name, but this should be straightforward. First try it with the whitespace and then let me know if we need to tackle that further.

Best wishes, Marie

AlbertoPiris commented 1 year ago

Hi!

It worked! Thank you very much for the help. The commands for fixing the whitespace issue and then running MGE worked, so the program gave me back the consensus sequence.

But now I've tried with some new sequences from another samples, and the software runs but then the consensus sequence is blank. But as the first sequence worked, I suppose this must be a problem with the new sequences themself, or maybe I haven't generated the COI sequence correctly and it’s too specific. Any idea? Have you had any problem like this, that the retrieved consensus sequence is totally blank?

Another option that could be possible is that, as UCEs work better with amniotes than with amphibians, and these samples are amphibian sequences, maybe the mitochondrial genes are not entirely recovered. I will also try with some reptiles sequences to see if there is any difference, and the problem could be related with this. Thank you so much!

marievalerie commented 1 year ago

Hi, glad it worked.

When none of your input reads align to your reference, your output will be empty. Or more specifically, you should have the fasta header in your consensus, but then no sequence.

My first guess is that this is a physical problem i.e., you just don't have reads in your input library that originated from COI. In the MGE paper, we had that for ~15 COI sequences from roughly 2000 sequencing libraries so it's rather rare, but it can happen. Absent data can't be fixed, unfortunately. Did you specifically enriched mitochondrial loci in this library or do you expect them to be a byproduct?

Whether your reference is too specific depends on the taxonomic spectrum you want to analyze and the data you used to create the reference. I guess you are working with the genus Alytes, since you named your reference like that as well as your read data. If all the read data you want to analyze is from that genus and if your reference was constructed based on that genus, it should be fine. MGE is more sensitive, the better the reference fits the taxonomic level of your input data.

But if you analyze a wider taxonomic range, then the reference might be too specific and the reads will not align anymore to the reference. This can also happen if you reconstructed your reference with sequences from NCBI, but at NCBI one taxon is highly overrepresented und you constructed your reference effectively based on one (or only a few) species.

I deposited some references for major taxonomic groups here at GitHub. https://github.com/cmayer/MitoGeneExtractor/tree/main/Amino-Acid-references-for-taxonomic-groups/COI-references-for-different-taxonomic-groups

If you want, try the reference for vertebrates or amphibians to see if you can get better results with a more general reference. If that works, then double check that your output consensus COI starts and ends at the right position. I recall that in frogs, COI can be somehow different from other vertebrates, could be that you have in the beginning some indels or something like that.

AlbertoPiris commented 1 year ago

Hi,

I have tried different ways. For the same 8 samples (all supposed to be Alytes maurus), I tried with these three references: Alytes (created by me with NCBI sequences), amphibian and vertebrates (given by you). Some worked and another ones didn't. However, the retrieved sequences are too short, and dont get a lot of the CDS.

 - For the Alytes, I only got one COI consensus sequence. I copied and pasted it in Blast, and it gave me back 100% of Alytes maurus (which is supposed to be correct).
 - For the amphibian reference, I got 4 consensus out of the 8 samples. Of those, 3 were identified as Alytes (but not maurus, so the species was failed) and the other one was identified as another genera (I don't know if it could be contamination or any other problem).
 - For the vertebrate reference, I got the same as for the amphibia one. 

Taking this into account, maybe the problem could be generated because of a physical problem with the sequences, as you said. These are UCE libraries, and at first we didn't intend to recover the mitochondrial, we knew later that some of it could be recovered as a byproduct. So I think that depends on several factors (how good each sequences is, the reads it has, the taxon...).

marievalerie commented 1 year ago

Based on what you say here, I think you had one library with a sufficiently high COI coverage (the one that worked with the genus specific ref). The other 3 which only worked with more general references might either show higher genetic distance to your Alytes reference, or you included more reads that are deviating from the 'true' (i.e., species or even individual specific) COI sequence by using a more general reference. These reads could originate from NUMTs or contamination and especially when you have a very gappy consensus sequences, it is possible that not the real/right COI locus contributed these reads. If you want to explore that in more detail, check the alignment file generated by MGE. If you want me to have a look, you could send me the files (alignment and consensus) or better a link to them privately at m.brasseur@leibniz-lib.de

However, when no reads can be pulled out of the library, there will be no consensus. As you said, several factors could add up why this can happen.

Best, Marie

cmayer commented 1 year ago

In the case of UCE libraries, MGE basically mines the off target sequences. For some libraries this works better than for others. In the case of a better enrichment, the off target reads are fewer.

A perfect usecase of MGE is to indentify contaminated libraries. So if you find something that differs from what you expect, it could be a contamination in the library. This can happen in your lab but also in the lab of the sequencing company. A few years ago we had big problems with a sequencing company that tagged our libraries with bad tags. Also they used single tagging. In the end we had considerable cross-talking in the libraries. If you extract genes of interest and if they are more similar than expected, to something else, this would be another hint in this direction. So if MGE finds reads from a different species in considerable quanties, you should look at you data carefully.