Detect and identify missing metagenomes

Smithmania commented 1 year ago

Can we add a feature that prints text like you use for files that are not available (e.g., " not available") for metagenomes that are missing from the MGSD tags? This would probably be acheived by making a list of all IDs tagged as metagenomes and comparing to the list polled from MGSD tags and finding the difference.

This feature will be beneficial for us and end users to track the progress of MG analysis.

hou098 commented 1 year ago

@Smithmania

This would probably be acheived by making a list of all IDs tagged as metagenomes

How are samples tagged as having metagenome data? Do you mean those samples that have an OTU with the special metaxa_from_metagnome amplicon, or something else? The only tag I'm aware of is the 'type:amdb-metagenomics-analysed tag on CKAN packages.

Smithmania commented 1 year ago

there should be a query that the metagenomics tag on the CKAN web portal uses . If you look on the CKAN portal on the lefthand side under tags you will see the metagenomics tag and its currently showing 1893 samples- you might need to ask Mark.T for the exact query he uses for that.

On 18 Oct 2022, at 9:21 am, David Houlder @.***> wrote:

@Smithmania https://github.com/Smithmania This would probably be acheived by making a list of all IDs tagged as metagenomes

How are samples tagged as having metagenome data? Do you mean those samples that have an OTU with the special metaxa_from_metagnome amplicon, or something else? The only tag I'm aware of is the 'type:amdb-metagenomics-analysed tag on CKAN packages.

— Reply to this email directly, view it on GitHub https://github.com/BioplatformsAustralia/bpaotu/issues/245#issuecomment-1281568418, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALSSEDMU5W3QVPNZVPPSCJLWDXGQJANCNFSM6AAAAAARE5LNRI. You are receiving this because you were mentioned.

hou098 commented 1 year ago

@Smithmania: This might almost be a one-liner using the CKAN python API.

Something along the lines of…

package_search(q='type:(not amdb-metagenomics-analysed)', fq='tags:metagenomics')

… might work. I will investigate.

hou098 commented 1 year ago

This seems to work

ckan_remote_object.action.package_search(
    q='(tags:metagenomics) AND NOT (type:amdb-metagenomics-analysed)',
    rows=3000)['results'])

1893 results

hou098 commented 1 year ago

Clarifying the data model for the production of these CKAN datasets.

Datasets with tags:metagenomics and not of type:amdb-metagenomics-analysed are processed to generate new datasets that have tags:metagenomics and are of type:amdb-metagenomics-analysed. Only "input" datasets for this process that have res_format:FASTQ are eligible for processing.

Smithmania commented 1 year ago

Sounds about right as far as the current tags/types are named

I found the 2 sampleIDs that don’t have fastq data so we should be able to check your logic

On 19 Oct 2022, at 11:37 am, David Houlder @.***> wrote:

Clarifying the data model for the production of these CKAN datasets.

Datasets with tags:metagenomics and not of type:amdb-metagenomics-analysed are processed to generate new datasets that have tags:metagenomics and are of type:amdb-metagenomics-analysed. Only "input" datasets for this process that have res_format:FASTQ are eligible for processing.

— Reply to this email directly, view it on GitHub https://github.com/BioplatformsAustralia/bpaotu/issues/245#issuecomment-1283183127, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALSSEDPSAQX5KGNXJVERQBLWD47DLANCNFSM6AAAAAARE5LNRI. You are receiving this because you were mentioned.

hou098 commented 1 year ago

OK, how about a "special" URL, say, /metagenome/status to report this? There's already a precedent for this kind of thing as we have the undocumented /ingest/ URL for the ingest report.

Smithmania commented 1 year ago

can we set up a meeting tomorrow morning to discuss - does 10 am work for you?

On 24 Oct 2022, at 1:59 pm, David Houlder @.***> wrote:

OK, how about a "special" URL, say, /metagenome/status to report this? There's already a precedent for this kind of thing as we have the undocumented /ingest/ URL for the ingest report.

— Reply to this email directly, view it on GitHub https://github.com/BioplatformsAustralia/bpaotu/issues/245#issuecomment-1288347360, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALSSEDKBZWTW4FARYMSQNJ3WEX3SDANCNFSM6AAAAAARE5LNRI. You are receiving this because you were mentioned.

hou098 commented 1 year ago

OK. 10:00AM Tuesday.

On Sun, 2022-10-23 at 20:51 -0700, Smithmania wrote:

can we set up a meeting tomorrow morning to discuss - does 10 am work for you?

On 24 Oct 2022, at 1:59 pm, David Houlder @.***> wrote:

OK, how about a "special" URL, say, /metagenome/status to report this? There's already a precedent for this kind of thing as we have the undocumented /ingest/ URL for the ingest report.

— Reply to this email directly, view it on GitHub https://github.com/BioplatformsAustralia/bpaotu/issues/245#issuecomment-1288347360, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALSSEDKBZWTW4FARYMSQNJ3WEX3SDANCNFSM6AAAAAARE5LNRI. You are receiving this because you were mentioned.

— Reply to this email directly, view it on GitHubhttps://github.com/BioplatformsAustralia/bpaotu/issues/245#issuecomment-1288378254, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AWMZYYURL2WGSC3DXZCOKVDWEYBSNANCNFSM6AAAAAARE5LNRI. You are receiving this because you commented.Message ID: @.***>

Smithmania commented 1 year ago

Modify metagenome download functionality to mimic the non-denoised amplicon behaviour. In this case MGSD files requested by a user for selected amplicon(s) will be emailed via the bpa help desk to the analytics team. The analytics team will manually distribute the requested files in consultation with the user.

As no MGSD data will be housed on CKAN, available metagenome samples should be identified by searching tags metagenomics and filetype FASTQ. Instead of prioviding a list of available files for each individual sample, the popup selector should contain the list of all file types that the user can select (as in the Download zip archive of selected metagenome files for selected samples button. The analytics team will determine if those files are available or not for the sample and consult with the individual requesting the data. It would still be good if samples not meeting metadata requirements are excluded, however this can be done by the analytics team when retrieving the data. Interactive sample searches using metadata such as (e.g. lat, long, vegetation type, environment etc.) and map based would be good, ranther than the plain non denoised sample request.

a list of available files for each sampleID will be prepared as part of the data analysis workflow and provided in the bpa-otu ingest packet. This list will be used to populate the file availability popup instead of CKAN query. Samples excluded due to not meeting metadata requirements should either be removed as per the amplicons or better still flagged “unavailable due to non compliant metadata”.

We will need to re-think how the MG search is done.

Perhaps an Amplicon style selector to switch between metaxa and MGSD data products. Currently searching by metaxa data may omit some samples in the odd case where no results are returned from the metaxa analysis for that sample
Perhaps when the search button is pressed without any selected taxonomy being selected, the search results retrived will be from the CKAN metagenomics/FASTQ search

hou098 commented 1 year ago

Modify metagenome download functionality to mimic the non-denoised amplicon behaviour. In this case MGSD files requested by a user for selected amplicon(s) will be emailed via the bpa help desk to the analytics team. The analytics team will manually distribute the requested files in consultation with the user.

Just to be clear: you still want to be able to restrict by sample context search (e.g. lat, long, vegetation type, environment etc.), yes? i.e. you don't want something as plain as the non-denoised search where all you get to filter by is sample id.

We will need to re-think how the MG search is done. Perhaps an Amplicon style selector to switch between metaxa and MGSD data products. Currently searching by metaxa data may omit some samples in the odd case where no results are returned from the metaxa analysis for that sample

One possibility is to completely wildcard the amplicon part of the search. This should be possible without any dramatic ill effects. The only downside I can think of right now is that there will be more taxonomy selection options at every rank, as the choices will be built from all available values for that rank. (e.g. the Kingdom dropdown would include k_fungi as well as d_Archaea, and every other option ever available in the kingdom dropdown).

A further possibility is to remove the taxonomy dropdowns altogether in the metagenome search page.

Also, can you confirm that there's no metagenome data for the following samples? These are samples that have no associated otu or taxonomy info at all.

webapp=# select otu.sample_context.id, sample_site_location_description from otu.sample_context left outer join otu.sample_otu on otu.sample_context.id = otu.sample_otu.sample_id left join otu.otu on otu.sample_otu.otu_id = otu.otu.id left join otu.taxonomy_otu on otu.otu.id = otu.taxonomy_otu.otu_id where otu.sample_otu.otu_id is null;
   id   |                     sample_site_location_description                     
--------+--------------------------------------------------------------------------
 137929 | Mingenew
 7046   | Lake Lewis
 137799 | Kerang
 19572  | WCP12 (2003CN) - informal reserve (research) in production native forest
 138686 | Towra Point
 13554  | Antarctic
 34937  | inshore reef_Channel
 19571  | WCP12 (2003CN) - informal reserve (research) in production native forest
 7074   | Lake Way
 13566  | Antarctic
 141301 | Rottnest Island
 7072   | Mibbeyean Creek
 137853 | Clare
 34949  | inshore reef_Channel
 8290   | Rutherglen
 7073   | Lake Way
 13285  | King Island
 13734  | Credo Redgum Plot
 137923 | Tammin
(19 rows)

webapp=#

Smithmania commented 1 year ago

Also, can you confirm that there's no metagenome data for the following samples? These are samples that have no associated otu or taxonomy info at all.


webapp=# select otu.sample_context.id, sample_site_location_description from otu.sample_context left outer join otu.sample_otu on otu.sample_context.id = otu.sample_otu.sample_id left join otu.otu on otu.sample_otu.otu_id = otu.otu.id left join otu.taxonomy_otu on otu.otu.id = otu.taxonomy_otu.otu_id where otu.sample_otu.otu_id is null;
   id   |                     sample_site_location_description                     
--------+--------------------------------------------------------------------------
 137929 | Mingenew
 7046   | Lake Lewis
 137799 | Kerang
 19572  | WCP12 (2003CN) - informal reserve (research) in production native forest
 138686 | Towra Point
 13554  | Antarctic
 34937  | inshore reef_Channel
 19571  | WCP12 (2003CN) - informal reserve (research) in production native forest
 7074   | Lake Way
 13566  | Antarctic
 141301 | Rottnest Island
 7072   | Mibbeyean Creek
 137853 | Clare
 34949  | inshore reef_Channel
 8290   | Rutherglen
 7073   | Lake Way
 13285  | King Island
 13734  | Credo Redgum Plot
 137923 | Tammin
(19 rows)

I dont see any datasets returned on ckan (using search sample_id:102.100.100.<sample_id> on CKAN for the above samples except for 34949, this sample was on a 16S (plate AUWLK) - and it looks like it failed sequencing by the number of returned reads. It does look like we have metadata in our DB for all samples, at a quick glance it looks like it meets minimal standards - so its likely those samples completely failed sequencing (no fastq generated).

hou098 commented 1 year ago

@Smithmania What do you think about this:

One possibility is to completely wildcard the amplicon part of the search. This should be possible without any dramatic ill effects. The only downside I can think of right now is that there will be more taxonomy selection options at every rank, as the choices will be built from all available values for that rank. (e.g. the Kingdom dropdown would include k_fungi as well as d_Archaea, and every other option ever available in the kingdom dropdown).

hou098 commented 1 year ago

Modify metagenome download functionality to mimic the non-denoised amplicon behaviour. In this case MGSD files requested by a user for selected amplicon(s) will be emailed via the bpa help desk to the analytics team. The analytics team will manually distribute the requested files in consultation with the user.

Implemented in https://github.com/BioplatformsAustralia/bpaotu/tree/1.36.0 (see https://github.com/BioplatformsAustralia/bpaotu/commit/1f1647a725578c920935b952eb58e8ea5e82c88e )

In metagenome mode, the amplicon selector can be set to '--', which selects every sample tagged as having metagenome data, regardless of taxonomy.

BioplatformsAustralia / bpaotu

Detect and identify missing metagenomes #245