MicroB3-IS / osd-analysis

Repository for all Ocean Sampling Day related source code with information on how-to acquire OSD data
Apache License 2.0
13 stars 7 forks source link

OTU table 18S #22

Open ramonmassana opened 9 years ago

ramonmassana commented 9 years ago

Hi all, I need to have the reference sequences from the OTU table. The OTU table has 96,764 OTUs. So I need to have 96,764 reference sequences. Also, I am wondering if chimera check was processed in this dataset. In a very superficial look at the sequences I could spot easily very obvious chimeras. For instance, sequence HWI-M02024:112:000000000-ACJ3F:1:1101:10006:19946 (corresponding to the sixth OTU in the OTU table) in is a chimera between a copepod and a ascomycota. Thanks for your help Ramon Massana

ikostadi commented 9 years ago

Hi Ramon,

by looking at the number of OTUs you mention, I assume you are talking about the LGC 18S data. Anyway, the following is true for all datasets processed with SILVAngs. In principle you have everything you need already, just not in the right form. I recommend you have a look at #14 before you go on, it might shed some light on the OTU mapping SILVAngs does and what I in this context call 'metaOTUs'.

The files you would be interested in are:

What you need to do is:

  1. Extract the list ids of metaOTU references. You can do that by either exporting the list from the OTU table you mentioned (column 1) or from the _osd2014_18s---ssu---otumapping.stats (column 3). Example with bash/gnu tools:
tail -n +2 osd2014_18s---ssu---otu_mapping.stats | cut -f 3 > meta_out_reference_ids.txt
  1. Generate the FASTA file you need. You can do that by filtering the FASTA files in the _/exports/otureferences/ directory. Some scripting or an external tool will be involved. Quick and dirty example with GNU grep:
 grep -h -A 1 -F -f meta_out_reference_ids.txt /exports/otu_references/osd2014_18s---ssu---otu_references---OSD*.fna | grep -v -e '^--' > test.fna

Some toolkits like QIIME provide scripts to do that (e.g. filter_fasta.py)

Regarding your chimera question, SILVAngs does not perform chimera check. We had a very quick look at the sequence you mentioned but couldn't confirm your observation right away. Can you please send some more details on how you identified the chimera (looking at a tree, blasting against custom-curated in-house database, etc.)?

Hope this helps.

Best, Ivo

ramonmassana commented 9 years ago

Hi Ivo,

Yes, I am talking about the LGC 18S rDNA OTU table.

In most routines, you generate an OTU table, together with a list of the representative sequences of each OTU. There are different ways to select the "representative sequence" between the pool of very similar sequences included in each OTU. I guess the most used is to select the most common sequence.

So, you should provide this list of reference sequences in order to make people's life much easier. I understand that we have the data to make it, but for you would be much easier and everybody will end with the same list of reference sequences.

Second, it is fundamental to process your reference sequences through a chimera check routine. Chimeras do occur, are very frequent, and account for a large number of OTUs (generally at low abundance). They can be easily removed with several programs. So, why not to clean them?

Regarding the chimera from my previous message, HWI-M02024:112:000000000-ACJ3F:1:1101:10006:19946 (corresponding to the sixth OTU in the OTU table), I know that is a chimera just by the results on a BLAST search: a. The closest match is only 91%. b. When you look at the sequence alignment, it is very conserved at the beginning (which is the most variable region) and variable at the end (which is the most conserved region). This does not make sense c. The first part of the sequence (1-180) is Temora turbinata, a copepod (metazoan) d. The second part of the sequence (181-380) is a 100% to Hypocreales sp., an ascomycota fungi

Here, Ramiro processed your reads to have an OTU table with reference sequences and without chimeras. After removing singletons, there are 18,503 OTUs, a number much lower than in your table (50,058).

Best regards,

Ramon

El 21/09/2015, a las 19:03, Ivo escribi�:

Hi Ramon,

by looking at the number of OTUs you mention, I assume you are talking about the LGC 18S data. Anyway, the following is true for all datasets processed with SILVAngs. In principle you have everything you need already, just not in the right form. I recommend you have a look at #14 before you go on, it might shed some light on the OTU mapping SILVAngs does and what I in this context call 'metaOTUs'.

The files you would be interested in are:

� /stats/data/osd2014_18s---ssu---otu_mapping.stats - contains all ids (and other information) about the OTU mapping (what I started to call 'meta OTUs') � /exports/osd2014_18s---ssu---otus.csv - contains information on all OTUs � /exports/otu_references/ - contains FASTA files of all OTU references (separated by sample and classification) What you need to do is

  1. Extract the list ids of metaOTU references. You can do that by either exporting the list from the OTU table you mentioned (column 1) or from the osd2014_18s---ssu---otu_mapping.stats (column 3). Example with bash/gnu tools:

tail -n +2 osd2014_18s---ssu---otu_mapping.stats | cut -f 3 > meta_out_reference_ids.txt

� Generate the FASTA file you need. You can do that by either filtering the /osd2014_18s---ssu---otus.csv file and re-formatting it to FASTA or filtering the FASTA files in the /exports/otu_references/ directory. Some scripting or an external tool will be involved. Quick and dirty example with GNU grep: grep -h -A 1 -F -f meta_out_reference_ids.txt /exports/otu_references/osd2014_18s---ssu---otu_references---OSD*.fna | grep -v -e '^--' > test.fna

Some toolkits like QIIME provide scripts to do that (e.g. filter_fasta.py)

Regarding your chimera question, SILVAngs does not perform chimera check. We had a very quick look at the sequence you mentioned but couldn't confirm your observation right away. Can you please send some more details on how you identified the chimera (looking at a tree, blasting against custom-curated in-house database, etc.)?

Hope this helps.

Best, Ivo

� Reply to this email directly or view it on GitHub.


Ramon Massana i Molera Institut de Ci�ncies del Mar, CSIC Passeig Mar�tim de la Barceloneta, 37-49 08003 Barcelona, Catalonia, Spain E-mail: ramonm@icm.csic.es Phone: 34-93-2309500. Direct phone: 34-93-2309599 Fax: 34-93-2309555 http://www.icm.csic.es/bio/projects/icmicrobis/massana


ikostadi commented 9 years ago

Hi Ramon,

thank you for your feedback.

The SILVAngs pipeline does not produce an OTU-by-sample table by default. The OTU table was requested at the OSD Analysis Workshop in March. Both the OSD Team and the SILVA Tram have invested a considerable amount of time in providing these tables as part of the result packages. We will consider providing a FASTA of the reference sequences from the OTU tables in the future, as per your request. At the moment our efforts are focused at the OSD 2015 datasets. In the mean time, the method I outlined above does guarantee that everyone ends up with the same FASTA file (the reference sequence is already selected by the pipeline). Minor complementary information: In step 2, please extract the sequences from the FASTA files and not from the CSV file!

SILVAngs does not check for chimeras. The reason is that after extensive testing and comparisons, no tool seems to deliver reliable results. However, the coverage of a query sequence is considered during its classification. If a sequence is chimeric, it's likely that the alignment coverage will cause the sequence to be classified as 'No Relative'. This is only an indication and chimeras may well be classified like other sequences, especially if the chimera is formed of two sequences from the same species or genus. As it is obvious from your case, some chimeras do slip through.

Your analysis is a valuable contribution to the OSD community. If you wish to discuss the chimera topic in length and compare approaches with the OSD analysis team, I would be happy to put you in touch.

Best, Ivo

ikostadi commented 9 years ago

Dear Ramon, since I haven't heard back from you in a while I assume you are satisfied with the answers and I am closing the issue. Best, Ivo

ramonmassana commented 9 years ago

Hi Ivo,

Yes, I have not repplied to your last message. I thought it was from a couple of weeks ago, but now I realize that it was from one month ago!

Sorry for the delay, I have been very busy these days.

Coming to the OSD data, I still think it would be better that you create the reference sequences and, most importantly, it is very important to run standard chimera check programs. As it is, the output available from the OSD project on eukaryotes is very hard to work with.

And, as I said befofe, Ramiro processed your reads to have an OTU table with reference sequences and without chimeras. After removing singletons, there are 18,503 OTUs, a number much lower than in your table (50,058). We are glad to share this OTU tabe to the eukaryotic consortium.

Best regards

Ramon

El 22/10/2015, a las 15:47, Ivo escribió:

Dear Ramon, since I haven't heard back from you in a while I assume you are satisfied with the answers and I am closing the issue. Best, Ivo

— Reply to this email directly or view it on GitHub.


Ramon Massana i Molera Institut de Ciències del Mar, CSIC Passeig Marítim de la Barceloneta, 37-49 08003 Barcelona, Catalonia, Spain E-mail: ramonm@icm.csic.es Phone: 34-93-2309500. Direct phone: 34-93-2309599 Fax: 34-93-2309555 http://www.icm.csic.es/bio/projects/icmicrobis/massana