RemiAllio / MitoFinder

MitoFinder: efficient automated large-scale extraction of mitogenomic data from high throughput sequencing data
88 stars 15 forks source link

problem_in_poducing_the_[Seq_ID]_final_genes_NT.fasta #6

Closed Niloofar-Alaei closed 4 years ago

Niloofar-Alaei commented 4 years ago

Hi Dear Thanks a lot for this useful program. I have two questions:

  1. I am using mitoFinder for 40 samples. For 7 of them the mitoFinder doesn’t give the final result ” [Seq_ID]_final_genes_NT.fasta” and instead of that provide the file “[SeqID] mtDNA_contig_genes_NT.fasta”. In addition at the end of log files of these 7 samples, just mention “… genes were found in mtDNA_contig” and don’t mention the name of contigs and so on..

Would you please let me know, what is the problem and how can I solve it?

  1. At the end of the log file of some of my samples, I receive one warning message: “genes were found more than once suggesting either fragmentation, NUMT annotations, or potential contamination of your sequencing data. Different contigs may be part of different organisms.” Some of my them have a high coverage and I am sure also they don’t have any contamination.

Now I want to know, can I overlook the warning message?

Please let me know, if you have any idea about my issues

Cheers Niloo

RemiAllio commented 4 years ago

Hi there,

Thank you for your feedback!

1. MitoFinder has different messages depending on how many contigs it finds. In this sense, I think that in 7 of your 40 samples, MitoFinder found only one contig matching the reference(s). In that case, the files [Seq_ID]final_genes_NT.fasta and [Seq_ID] mtDNA_contig_genes_NT.fasta are the same because the file [Seq_ID]final_genes_NT.fasta is created by selected the longest genes (when found in several contigs) of each contigs. So If only one contig is found, there is no need to create this file. Does it make sense ? Please could you confirm that this happened when MitoFinder has found a single contig matching the reference ?

  1. This warning message is general and is there to warn the user. I think you can overlook the warning message if you're sure about the content of your data and/or if the complete mitochondrial contig is found among the contigs matching the reference (containing all genes and having the expected size). For example if you have something like that :

Note: 15 genes were found in mtDNA_contig_1 1 gene was found in mtDNA_contig_2 0 gene was found in mtDNA_contig_3

The mitochondrial contig for this species is likely the contig 1 and you can use this contig and associated files for further analyses.

However, a general advice would be to check the genes found several times. Here, the gene found in contig 2 is either a portion of an incomplete gene in the contig 1 or a contamination or a NUMT.

Note: Default parameters allows MitoFinder to find genes in relatively distant species. If you have a close reference, you can change the blast parameters to limit false positive annotations (--blast-identity-nucl; --blast-identity-prot; --blast-size)

I hope this answer will help you. Cheers, Rémi

RemiAllio commented 4 years ago

Hi !

Well, for the point 1, I double double-checked and you can have several contigs found by MitoFinder but no file named [Seq_ID]final_genes_NT.fasta. This happens when you have only one contigs containing mitochondrial genes.

For example: 15 genes were found in mtDNA_contig_1 0 gene was found in mtDNA_contig_2

In that case, you don't have the file because only the contig 1 contains genes. So in the current version (v1.2) the "final" genes are in the file [Seq_ID]_mtDNA_contig_1_genes_NT.fasta.

To make it easier, in the next version (v1.3, coming soon), the file containing the "final" genes will always be created!

Thank you for your comment! Cheers, Rémi

Niloofar-Alaei commented 4 years ago

Hi, Thanks for your response, yes the MitoFinder find a single contig matching to References for these seven samples.

Cheers Niloo

RemiAllio commented 4 years ago

Hi,

You're welcome :-)

Cheers, Rémi

Niloofar-Alaei commented 4 years ago

Hi,

For this case, I doubled check all the samples, just to be sure, I find that:

A) I have some samples that only the contig 1 contains genes, but at
the end of log file, as you said, mentioned:

15 genes were found in mtDNA_contig_1 0 gene was found in mtDNA_contig_2

and in the final result folder, there are files concerning both
contigs 1 and 2 but the two files of (contig_2_genes_AA.fasta and
NT.fasta) are empty and there is _final_genes_NT.fasta file too.

B) although for these 7 samples:

at the end of log file just mention

15 genes were found in mtDNA_contig

and also I can not do multiple alignment for these 7 with rest of
samples using MAFFT, and these 7 samples are not aligned with the
rest. I should just align them gene by gene, I don't know why!!!!!

I want to inform you and ask what is the differences between these two
cases that I mention (A and B)

Cheers Niloo Quoting Rémi Allio notifications@github.com:

Hi !

Well, for the point 1, I double double-checked and you can have
several contigs found by MitoFinder but no file named
[Seq_ID]final_genes_NT.fasta. This happens when you have only one
contigs containing mitochondrial genes.

For example: 15 genes were found in mtDNA_contig_1 0 gene was found in mtDNA_contig_2

In that case, you don't have the file because only the contig 1
contains genes. So in the current version (v1.2) the "final" genes are in the file
[Seq_ID]_mtDNA_contig_1_genes_NT.fasta.

To make it easier, in the next version (v1.3, coming soon), the file
containing the "final" genes will always be created!

Thank you for your comment! Cheers, Rémi

-- You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub: https://github.com/RemiAllio/MitoFinder/issues/6#issuecomment-633538208

RemiAllio commented 4 years ago

Hi,

The difference between A and B is that MitoFinder found either several contigs matching the reference (A) or only a single contig matching the reference (B).

In the first case, despite several contigs matching the reference, mitochondrial genes are found only in contig 1. So the file containing the mitochondrial genes is (in v1.2) "[SeqID]_mtDNA_contig_1_genes_NT.fasta" In the second case, a single contig matchs the reference so the name of the file containing the mitochondrial genes is (in v1.2) [SeqID]_mtDNA_contig_genes_NT.fasta.

Then, If you want to do multiple alignments for mitochondrial genes (that's what you want to do, right ?) , you have to extract each genes (e.g. COX1, then COX2 ...) of these two files and align them with MAFFT for example.

Does it help ? I hope it does because it difficult for me to well understand without seeing the files ...

(the version 1.3 is now online and you can get it by using git clone https://github.com/RemiAllio/MitoFinder.git)

Cheers, Rémi

Niloofar-Alaei commented 4 years ago

Hi

Yes, now its complitely clear for me

yes I want to do the multiple alignement. and I am doning the same as you said

Thanks alot

Niloo

Quoting Rémi Allio notifications@github.com:

Hi,

The difference between A and B is that MitoFinder found either
several contigs matching the reference (A) or only a single contig
matching the reference (B).

In the first case, despite several contigs matching the reference,
mitochondrial genes are found only in contig 1. So the file
containing the mitochondrial genes is (in v1.2)
"[SeqID]_mtDNA_contig_1_genes_NT.fasta" In the second case, a single contig matchs the reference so the name
of the file containing the mitochondrial genes is (in v1.2)
[SeqID]_mtDNA_contig_genes_NT.fasta.

Then, If you want to do multiple alignments for mitochondrial genes
(that's what you want to do, right ?) , you have to extract each
genes (e.g. COX1, then COX2 ...) of these two files and align them
with MAFFT for example.

Does it help ? I hope it does because it difficult for me to well understand
without seeing the files ...

(the version 1.3 is now online and you can get it by using git clone https://github.com/RemiAllio/MitoFinder.git)

Cheers, Rémi

-- You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub: https://github.com/RemiAllio/MitoFinder/issues/6#issuecomment-633958260

Niloofar-Alaei commented 4 years ago

Hi dear Rémi

First of all, I should say sorry for asking lots of question and say
thank you for taking your time to answer me.

I want to know how does mitoFinder choose the final gene
(final_genes_NT.fasta)?

I see the output files of MitoFinder, there are two situations, for
example in these case: 15 genes were found in mtDNA_contig_1 2 gene was found in mtDNA_contig_2 (e.g. included COX2 and
ATP8 genes) 0 gene was found in mtDNA_contig_3

A) the sequences of COX2 and ATP8 in the Final file is the same as
contig-1 that included all genes

B) although I can find some samples that the sequences of COX2 and
ATP8 in the Final file are the same as contig_2.

Based on the log file “Mitofinder selected the longest sequence as the
final sequence”, I see all the associated files for these two contigs,
the length of these genes are the same in both of them and honestly I
cannot find any clue that how the mitoFinder choose the genes for the
final files.

I want to know in such a case, can I use the contig-1_genes_NT.fasta
for the rest of analyses or I should choose the final_genes_NT.fasta?
and its important for me to know how mitoFnder choose the genes for
the final_genes_NT.fasta.

Cheers Niloo

Quoting Rémi Allio notifications@github.com:

Hi,

You're welcome :-)

Cheers, Rémi

-- You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub: https://github.com/RemiAllio/MitoFinder/issues/6#issuecomment-633591499

RemiAllio commented 4 years ago

Hi!

Don't worry, it's good for me to get feedback!

Well, as you have already seen, MitoFinder choose the longest gene as "final" gene. In the case of two genes have exactly the same size, one of them is (randomly) chosen (I will change that). However, finding two genes with exactly the same size (probably complete genes) suggests either contamination (annotation of a complete gene of a contaminant) or NUMT annotation (a mitochondrial gene inserted in a nuclear contig). Do you find any difference between the two genes (e.g. COX2) found in mtDNA_contig_1 and mtDNA_contig_2 ?

In your case, I recommend you to use the contig containing all the mitochondrial genes and the associated files for further analyses: [SeqID]_mtDNA_contig_1_genes_NT.fasta; [SeqID]_mtDNA_contig_1_genes_AA.fasta. Indeed, this contig is likely the (nearly) complete mitochondrial genome of your focal species.

Cheers, Rémi

Niloofar-Alaei commented 4 years ago

Hi Thanks for your response

Yes in this case the both COX2 and ATP8 are different in the contig-1
and contig-2

cheers Niloo Quoting Rémi Allio notifications@github.com:

Hi!

Don't worry, it's good for me to get feedback!

Well, as you have already seen, MitoFinder choose the longest gene
as "final" gene. In the case of two genes have exactly the same
size, one of them is (randomly) chosen (I will change that).
However, finding two genes with exactly the same size (probably
complete genes) suggests either contamination (annotation of a
complete gene of a contaminant) or NUMT annotation (a mitochondrial
gene inserted in a nuclear contig). Do you find any difference
between the two genes (e.g. COX2) found in mtDNA_contig_1 and
mtDNA_contig_2 ?

In your case, I recommend you to use the contig containing all the
mitochondrial genes and the associated files for further analyses:
[SeqID]_mtDNA_contig_1_genes_NT.fasta;
[SeqID]_mtDNA_contig_1_genes_AA.fasta. Indeed, this contig is likely
the (nearly) complete mitochondrial genome of your focal species.

Cheers, Rémi

-- You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub: https://github.com/RemiAllio/MitoFinder/issues/6#issuecomment-634581344

RemiAllio commented 4 years ago

Hi,

Well, I recommend to use the genes associated with the contig containing all other genes.

Then, I recommend you to blast the two others genes (from contig 2) on NCBI here to see wether it's a contamination or a NUMT.

Cheers, Rémi

Niloofar-Alaei commented 4 years ago

Hi,

Thanks a lot for all your help

Cheers Niloo Quoting Rémi Allio notifications@github.com:

Hi,

Well, I recommend to use the genes associated with the contig
containing all other genes.

Then, I recommend you to blast the two others genes (from contig 2)
on NCBI here to see
wether it's a contamination or a NUMT.

Cheers, Rémi

-- You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub: https://github.com/RemiAllio/MitoFinder/issues/6#issuecomment-634591757