Species distribution dataset

clairedavies commented 1 year ago

Please add a data table to the repo that includes the abundance, count, read data for Trichodesmium (any species), Noctiluca scintillans & any Chatonella, Heterosigma and Pseudochattonella species (fish killing habs). Include only data from the NRS stations, I will need the NRS trip code and a sample depth to match to metadata.

The plan is to plot a timeseries of 'relative abundance' from each taxa, with a seasonal climatology. Maybe plot a map of relative abundances.

If the data is not rarefied I will also need the total no of reads per sample, along with the reads of each taxa?

Happy to chat more if this isn't clear, not always sure of your terminology.

Thank you

clairedavies commented 1 year ago

Molecular taxonomy string from Jodie to help find the named taxa

SILVA v138 Bacteria; Cyanobacteria; Cyanobacteriia; Cyanobacteriales; Phormidiaceae; Trichodesmium

SILVA v138 Eukaryota; SAR; Alveolata; Dinoflagellata; Noctilucales; Noctiluca Eukaryota; SAR; Stramenopiles; Ochrophyta; Raphidophyceae; Chattonellales; Chattonella Eukaryota; SAR; Stramenopiles; Ochrophyta; Raphidophyceae; Chattonellales; Heterosigma Eukaryota; SAR; Stramenopiles; Ochrophyta; Dictyochophyceae; Florenciellales; Pseudochattonella

PR2 Eukaryota; TSAR; Alveolata; Dinoflagellata; Noctilucophyceae; Noctilucales; Noctilucaceae; Noctiluca OR Noctilucales_X Eukaryota; TSAR; Stramenopiles; Gyrista; Raphidophyceae; Raphidophyceae _X; Raphidophyceae _XX; Chattonella Eukaryota; TSAR; Stramenopiles; Gyrista; Raphidophyceae; Raphidophyceae _X; Raphidophyceae _XX; Heterosigma Eukaryota; TSAR; Stramenopiles; Gyrista; Dictyochophyceae; Dictyochophyceae_X; Florenciellales; Pseudochattonella

Smithmania commented 1 year ago

A test file at data/NRS_taxon_abundance.csv has been generated in branch 2.1.0. The test file was generated using the modified script SSI_1c_make_atlas_input.py and a new file NRS_taxon_list.csv are included in the same branch.

Input data includes:

Taxonomy table was based on Silva138 taxonomy classified by QIIME2 SKlearn.
Abundance table used was based on ASV sequences subsampled to 20000 reads.
NRS sampleIDs and selected metadata were extracted from the AM database using the search parameters:
- searchfield = 'imos_site_code'
- searchTerm = 'NRS%' (wild card search, all nrs sites have the format NRS***)
- returnfields = ['sample_id','depth', 'nrs_trip_code', 'nrs_sample_code','sample_integrity_warnings']

Samples containing sample integrity warnings were removed from the analysis.

The file NRS_taxon_list.csv holds a list of taxa to be retrieved, with one taxa per line, using Silva138 taxonomy. Each taxonomic level is comma separated and formatted with an additional taxonomic level prefix (e.g., d__<name>,p__<name>,c__<name>,o__<name>,f__<name>,g__<name>,s__<name>) as per current AM portal formatting (https://data.bioplatforms.com/bpa/otu/). As this file is read by SSI_1c_make_atlas_input.py, taxa contained in it will be included in the analysis making it easy to add taxa of interest. The script should recognise and select the appropriate amplicon for Bacteria, Eukaryota and Archaea.

Output file column format is: sample_id,depth,nrs_trip_code,nrs_sample_code,amplicon,g__Chattonella_abundance_20K,g__Heterosigma_abundance_20K,g__Pseudochattonella_abundance_20K,g__Trichodesmium_abundance_20K,s__Noctiluca_scintillans_abundance_20K

Feedback welcomed

clairedavies commented 1 year ago

Hi, Thanks again for doing this, I just had a quick look at the data. It looks fine but could I possibly request that the Trichodesmium data be at the species level instead of at the genus level.

So ideally there would be a column for T. erythraeum, T. thiebautii and a T. spp. for those that aren’t given a species name. There may be other species as well but I would definitely expect these two to be there.

We are interested in the distributions of the individual species, this is something that we can’t determine with light microscopy, so a real advantage of this data.

Thanks in advance

clairedavies commented 1 year ago

Please could you: 1) expand this dataset to include all stations (this is especially important for looking at Tricho in the GBR) 2) modify the data so that abundance table is based on ASV sequences for ALL reads, this way I can get a better estimation of the proportional representation of the taxa in the sample

Down the track, probably just for the Tricho project I would be looking for: 1) A data table of all the ASVs within the genus Trichodesmium From this I can work out the proportion of nifH genes that are Tricho, and we may then want to look at those that aren't ......

Thanks again. Your efforts are appreciated. Jodie is across the details if you need some clarification. I will be on leave until 20th June so no rush.

Smithmania commented 11 months ago

ASV based table is presented in branch 2.1.0_ASV. I included total abundance and unique ASV counts for all ASV and for ASVs subsampled to 20K reads. This is extended to the trait numbers too.

Some samples do not have ASV/taxonomy info this happens when a trait is present but no selected taxonomy is found in the sample. These can be excluded but I thought it might be interesting info down the track.

I have given Jodie the rundown of the sheet so she should be able to answer any questions.

clairedavies commented 10 months ago

Thanks for the ASV table. Please can you add all stations to the NRS_taxon_abundance.csv table. At the moment you have the searchTerm = 'NRS%', please can you drop this filter, Thanks

clairedavies commented 10 months ago

Ooppss had a better look at the ASV table and can see what you have done now. I'll have a play with this, it looks like it has what we want in it.

clairedavies commented 1 month ago

Please can you add Trichosphaerium (genus) to the ASV table. Jodie will work out the labelling in your db based on the taxonomy.

AusMicrobiome / microbial_ocean_atlas

Species distribution dataset #9