Lafond-LapalmeJ / MCSC_Decontamination

Scripts and documentations on the MCSC decontamination method
GNU General Public License v3.0
8 stars 2 forks source link

About Plantea kingdom #13

Closed wyim-pgl closed 7 years ago

wyim-pgl commented 7 years ago

Dear Lafond-LaplmeJ,

Hello!

This looks so nice to me.

Is there anyway to do with below option?

Thank you.

$TAXO_LVL: taxonomic level for the WR index (default: phylum)

TAXO_LVL="kingdom"

$WHITE_NAME: Name of the target taxon for the WR index (REQUIRED)

WHITE_NAME="Plantae"

Lafond-LapalmeJ commented 7 years ago

Of course you can Just set TAXO_LVL="kingdom" and WHITE_NAME="Plantae" in your .ini file. I have never test it, but it should work. Don't hesitate if you need help running it.

wyim-pgl commented 7 years ago

Thank you.

However I 've got this error with "kingdom" and "Plantae"

Can you please check this?

Reported 125 pairwise alignments, 125 HSSPs. 5 queries aligned. Extracting DIAMOND blast taxonomy... Formating the DIAMOND output... Running the MCSC algorithm... Computing the White-Ratio (WR) index and evaluating the clusters... The kingdom to keep is Plantae Top kingdom in /home/wyim/scratch/bin/MCSC_Decontamination/out_test_new/taxo_uniq.txt : Chlorobionta Use of uninitialized value $taxa2 in concatenation (.) or string at /home/wyim/scratch/bin/MCSC_Decontamination/scripts/cluster_eval.pl line 117. Use of uninitialized value $taxa3 in concatenation (.) or string at /home/wyim/scratch/bin/MCSC_Decontamination/scripts/cluster_eval.pl line 117. Use of uninitialized value $taxa4 in concatenation (.) or string at /home/wyim/scratch/bin/MCSC_Decontamination/scripts/cluster_eval.pl line 117. Use of uninitialized value $taxa5 in concatenation (.) or string at /home/wyim/scratch/bin/MCSC_Decontamination/scripts/cluster_eval.pl line 117. Use of uninitialized value $taxa2 in regexp compilation at /home/wyim/scratch/bin/MCSC_Decontamination/scripts/cluster_eval.pl line 146. Use of uninitialized value $taxa3 in regexp compilation at /home/wyim/scratch/bin/MCSC_Decontamination/scripts/cluster_eval.pl line 149. Use of uninitialized value $taxa4 in regexp compilation at /home/wyim/scratch/bin/MCSC_Decontamination/scripts/cluster_eval.pl line 152. Use of uninitialized value $taxa5 in regexp compilation at /home/wyim/scratch/bin/MCSC_Decontamination/scripts/cluster_eval.pl line 155. Use of uninitialized value $taxa2 in regexp compilation at /home/wyim/scratch/bin/MCSC_Decontamination/scripts/cluster_eval.pl line 146. Use of uninitialized value $taxa3 in regexp compilation at /home/wyim/scratch/bin/MCSC_Decontamination/scripts/cluster_eval.pl line 149. Use of uninitialized value $taxa4 in regexp compilation at /home/wyim/scratch/bin/MCSC_Decontamination/scripts/cluster_eval.pl line 152. Use of uninitialized value $taxa5 in regexp compilation at /home/wyim/scratch/bin/MCSC_Decontamination/scripts/cluster_eval.pl line 155. The kingdom Plantae is not present in taxo_uniq.txt file. Aborting.

Lafond-LapalmeJ commented 7 years ago

From this I see that diamond blast reported 125 alignments. All of them are Kingdom Chlorobionta (Viridiplantae or green plant). The ncbi taxonomy dump seem to use the kingdom Chlorobionta for your sequences. I know that Plantae is a kingdom but there is no plantae kingdom in those NCBI taxonomy files.

How many sequence do you have ? What is the mean length of your sequences ? All your blast results are assigned to Chlorobionta. Are you sure you have contaminant sequences ? If your contaminant sequences are inside a contig (assembly chimeras) the decontamination method that we propose for now can't identify partial contaminant sequences due to misassemblies. This is why the method is more suited for transcriptome decontamination. If it is the case, you might need to split those assembly chimeras before running the MCSC decontamination.

wyim-pgl commented 7 years ago

I did just test run with small set.

We whole transcripts assembly set, it work like a charm. Very nice.

Can I use it for organellar genome decontamination?

Such as mitochondria and Chloroplast?

Lafond-LapalmeJ commented 7 years ago

Good to ear that it works. I never tried to identify organellar sequences with the MCSC decontamination method. If you have clean data (no other contaminants), it could work. The clustering part can be done by the MCSC. But the cluster label won't work because it's all from the same species. You will have to compute a organellar % index for each cluster manually.

Keep me updated if you try this.