Closed Smithmania closed 1 year ago
@Smithmania
This would probably be acheived by making a list of all IDs tagged as metagenomes
How are samples tagged as having metagenome data? Do you mean those samples that have an OTU with the special metaxa_from_metagnome amplicon, or something else? The only tag I'm aware of is the 'type:amdb-metagenomics-analysed tag on CKAN packages.
there should be a query that the metagenomics tag on the CKAN web portal uses . If you look on the CKAN portal on the lefthand side under tags you will see the metagenomics tag and its currently showing 1893 samples- you might need to ask Mark.T for the exact query he uses for that.
On 18 Oct 2022, at 9:21 am, David Houlder @.***> wrote:
@Smithmania https://github.com/Smithmania This would probably be acheived by making a list of all IDs tagged as metagenomes
How are samples tagged as having metagenome data? Do you mean those samples that have an OTU with the special metaxa_from_metagnome amplicon, or something else? The only tag I'm aware of is the 'type:amdb-metagenomics-analysed tag on CKAN packages.
— Reply to this email directly, view it on GitHub https://github.com/BioplatformsAustralia/bpaotu/issues/245#issuecomment-1281568418, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALSSEDMU5W3QVPNZVPPSCJLWDXGQJANCNFSM6AAAAAARE5LNRI. You are receiving this because you were mentioned.
@Smithmania: This might almost be a one-liner using the CKAN python API.
Something along the lines of…
package_search(q='type:(not amdb-metagenomics-analysed)', fq='tags:metagenomics')
… might work. I will investigate.
This seems to work
ckan_remote_object.action.package_search(
q='(tags:metagenomics) AND NOT (type:amdb-metagenomics-analysed)',
rows=3000)['results'])
1893 results
Clarifying the data model for the production of these CKAN datasets.
Datasets with tags:metagenomics and not of type:amdb-metagenomics-analysed are processed to generate new datasets that have tags:metagenomics and are of type:amdb-metagenomics-analysed. Only "input" datasets for this process that have res_format:FASTQ are eligible for processing.
Sounds about right as far as the current tags/types are named
I found the 2 sampleIDs that don’t have fastq data so we should be able to check your logic
On 19 Oct 2022, at 11:37 am, David Houlder @.***> wrote:
Clarifying the data model for the production of these CKAN datasets.
Datasets with tags:metagenomics and not of type:amdb-metagenomics-analysed are processed to generate new datasets that have tags:metagenomics and are of type:amdb-metagenomics-analysed. Only "input" datasets for this process that have res_format:FASTQ are eligible for processing.
— Reply to this email directly, view it on GitHub https://github.com/BioplatformsAustralia/bpaotu/issues/245#issuecomment-1283183127, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALSSEDPSAQX5KGNXJVERQBLWD47DLANCNFSM6AAAAAARE5LNRI. You are receiving this because you were mentioned.
OK, how about a "special" URL, say, /metagenome/status
to report this? There's already a precedent for this kind of thing as we have the undocumented /ingest/
URL for the ingest report.
can we set up a meeting tomorrow morning to discuss - does 10 am work for you?
On 24 Oct 2022, at 1:59 pm, David Houlder @.***> wrote:
OK, how about a "special" URL, say, /metagenome/status to report this? There's already a precedent for this kind of thing as we have the undocumented /ingest/ URL for the ingest report.
— Reply to this email directly, view it on GitHub https://github.com/BioplatformsAustralia/bpaotu/issues/245#issuecomment-1288347360, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALSSEDKBZWTW4FARYMSQNJ3WEX3SDANCNFSM6AAAAAARE5LNRI. You are receiving this because you were mentioned.
OK. 10:00AM Tuesday.
On Sun, 2022-10-23 at 20:51 -0700, Smithmania wrote:
can we set up a meeting tomorrow morning to discuss - does 10 am work for you?
On 24 Oct 2022, at 1:59 pm, David Houlder @.***> wrote:
OK, how about a "special" URL, say, /metagenome/status to report this? There's already a precedent for this kind of thing as we have the undocumented /ingest/ URL for the ingest report.
— Reply to this email directly, view it on GitHub https://github.com/BioplatformsAustralia/bpaotu/issues/245#issuecomment-1288347360, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALSSEDKBZWTW4FARYMSQNJ3WEX3SDANCNFSM6AAAAAARE5LNRI. You are receiving this because you were mentioned.
— Reply to this email directly, view it on GitHubhttps://github.com/BioplatformsAustralia/bpaotu/issues/245#issuecomment-1288378254, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AWMZYYURL2WGSC3DXZCOKVDWEYBSNANCNFSM6AAAAAARE5LNRI. You are receiving this because you commented.Message ID: @.***>
Modify metagenome download functionality to mimic the non-denoised amplicon behaviour. In this case MGSD files requested by a user for selected amplicon(s) will be emailed via the bpa help desk to the analytics team. The analytics team will manually distribute the requested files in consultation with the user.
As no MGSD data will be housed on CKAN, available metagenome samples should be identified by searching tags metagenomics
and filetype FASTQ
. Instead of prioviding a list of available files for each individual sample, the popup selector should contain the list of all file types that the user can select (as in the Download zip archive of selected metagenome files for selected samples
button. The analytics team will determine if those files are available or not for the sample and consult with the individual requesting the data. It would still be good if samples not meeting metadata requirements are excluded, however this can be done by the analytics team when retrieving the data.
Interactive sample searches using metadata such as (e.g. lat, long, vegetation type, environment etc.) and map based would be good, ranther than the plain non denoised sample request.
a list of available files for each sampleID will be prepared as part of the data analysis workflow and provided in the bpa-otu ingest packet. This list will be used to populate the file availability popup instead of CKAN query. Samples excluded due to not meeting metadata requirements should either be removed as per the amplicons or better still flagged “unavailable due to non compliant metadata”.
We will need to re-think how the MG search is done.
Modify metagenome download functionality to mimic the non-denoised amplicon behaviour. In this case MGSD files requested by a user for selected amplicon(s) will be emailed via the bpa help desk to the analytics team. The analytics team will manually distribute the requested files in consultation with the user.
Just to be clear: you still want to be able to restrict by sample context search (e.g. lat, long, vegetation type, environment etc.), yes? i.e. you don't want something as plain as the non-denoised search where all you get to filter by is sample id.
We will need to re-think how the MG search is done. Perhaps an Amplicon style selector to switch between metaxa and MGSD data products. Currently searching by metaxa data may omit some samples in the odd case where no results are returned from the metaxa analysis for that sample
One possibility is to completely wildcard the amplicon part of the search. This should be possible without any dramatic ill effects. The only downside I can think of right now is that there will be more taxonomy selection options at every rank, as the choices will be built from all available values for that rank. (e.g. the Kingdom dropdown would include k_fungi as well as d_Archaea, and every other option ever available in the kingdom dropdown).
A further possibility is to remove the taxonomy dropdowns altogether in the metagenome search page.
Also, can you confirm that there's no metagenome data for the following samples? These are samples that have no associated otu or taxonomy info at all.
webapp=# select otu.sample_context.id, sample_site_location_description from otu.sample_context left outer join otu.sample_otu on otu.sample_context.id = otu.sample_otu.sample_id left join otu.otu on otu.sample_otu.otu_id = otu.otu.id left join otu.taxonomy_otu on otu.otu.id = otu.taxonomy_otu.otu_id where otu.sample_otu.otu_id is null;
id | sample_site_location_description
--------+--------------------------------------------------------------------------
137929 | Mingenew
7046 | Lake Lewis
137799 | Kerang
19572 | WCP12 (2003CN) - informal reserve (research) in production native forest
138686 | Towra Point
13554 | Antarctic
34937 | inshore reef_Channel
19571 | WCP12 (2003CN) - informal reserve (research) in production native forest
7074 | Lake Way
13566 | Antarctic
141301 | Rottnest Island
7072 | Mibbeyean Creek
137853 | Clare
34949 | inshore reef_Channel
8290 | Rutherglen
7073 | Lake Way
13285 | King Island
13734 | Credo Redgum Plot
137923 | Tammin
(19 rows)
webapp=#
Also, can you confirm that there's no metagenome data for the following samples? These are samples that have no associated otu or taxonomy info at all.
webapp=# select otu.sample_context.id, sample_site_location_description from otu.sample_context left outer join otu.sample_otu on otu.sample_context.id = otu.sample_otu.sample_id left join otu.otu on otu.sample_otu.otu_id = otu.otu.id left join otu.taxonomy_otu on otu.otu.id = otu.taxonomy_otu.otu_id where otu.sample_otu.otu_id is null; id | sample_site_location_description --------+-------------------------------------------------------------------------- 137929 | Mingenew 7046 | Lake Lewis 137799 | Kerang 19572 | WCP12 (2003CN) - informal reserve (research) in production native forest 138686 | Towra Point 13554 | Antarctic 34937 | inshore reef_Channel 19571 | WCP12 (2003CN) - informal reserve (research) in production native forest 7074 | Lake Way 13566 | Antarctic 141301 | Rottnest Island 7072 | Mibbeyean Creek 137853 | Clare 34949 | inshore reef_Channel 8290 | Rutherglen 7073 | Lake Way 13285 | King Island 13734 | Credo Redgum Plot 137923 | Tammin (19 rows)
I dont see any datasets returned on ckan (using search sample_id:102.100.100.<sample_id>
on CKAN for the above samples except for 34949, this sample was on a 16S (plate AUWLK) - and it looks like it failed sequencing by the number of returned reads. It does look like we have metadata in our DB for all samples, at a quick glance it looks like it meets minimal standards - so its likely those samples completely failed sequencing (no fastq generated).
@Smithmania What do you think about this:
One possibility is to completely wildcard the amplicon part of the search. This should be possible without any dramatic ill effects. The only downside I can think of right now is that there will be more taxonomy selection options at every rank, as the choices will be built from all available values for that rank. (e.g. the Kingdom dropdown would include k_fungi as well as d_Archaea, and every other option ever available in the kingdom dropdown).
Modify metagenome download functionality to mimic the non-denoised amplicon behaviour. In this case MGSD files requested by a user for selected amplicon(s) will be emailed via the bpa help desk to the analytics team. The analytics team will manually distribute the requested files in consultation with the user.
Implemented in https://github.com/BioplatformsAustralia/bpaotu/tree/1.36.0 (see https://github.com/BioplatformsAustralia/bpaotu/commit/1f1647a725578c920935b952eb58e8ea5e82c88e )
In metagenome mode, the amplicon selector can be set to '--', which selects every sample tagged as having metagenome data, regardless of taxonomy.
Can we add a feature that prints text like you use for files that are not available (e.g., " not available") for metagenomes that are missing from the MGSD tags? This would probably be acheived by making a list of all IDs tagged as metagenomes and comparing to the list polled from MGSD tags and finding the difference.
This feature will be beneficial for us and end users to track the progress of MG analysis.