Efetch module refactor - Githubissues

misialq commented 3 years ago

Disclaimer: this is a nasty but necessary PR.

This PR introduces a more-or-less complete refactor of the _efetch.py submodule. Rather than extract metadata from the NCBI response in an unstructured fashion, we can make use of the hierarchical structure of NCBI's SRA objects: study/sample/experiment/run. This will not only make the code easier to read but also will simplify handling of all the different metadata levels. It also prepares the fondue for the next PR on fetching by project ID (#13).

Here's a high level summary of the changes:

introduced a new _sra_meta.py submodule containing data classes for storing metadata of a specific SRA object and reflect how those objects are related to one another; moreover functions for retrieving metadata from those objects and their children are included to make creation of the final metadata DataFrame simpler
refactored the _efetch.py submodule to make use of the new objects described above
improved handling of duplicated custom attributes
changes GHA to at least run the tests on every push

Some issues that emerged while refactoring: #24, #25

This PR in no way influences fetching sequences.

A simplified diagram of the new workflow: new_metadata_flow

misialq commented 3 years ago

Done, review away! As discussed IRL, probably easier to review as if it was an entirely new thing. Also, please try it out with some run ids!

misialq commented 3 years ago

Hey @adamovanja, thanks for the review! The PR is now updated as per your comments, where applicable.

the GHA was moved to #30 (already merged) and this PR was updated accordingly
the "sample" issue will be "resolved" in #29
attached diagram: yeah, it needs an update (I generated it before writing most of the code so some things changed slightly) - feel free to open an issue to place that somewhere in the documentation

LenaFloerl commented 3 years ago

Hi - I tested several Run Acc. No. (also mixed from different BioProjects) and it works like a Gem. Using Sample IDs gives a error message as expected.

As discussed, it would be very useful to ultimately have the generated metadata file transformed, so that it can directly be used in a QIIME workflow. So far I was only able to use it after exporting it as a tsv.

misialq commented 3 years ago

As discussed, it would be very useful to ultimately have the generated metadata file transformed, so that it can directly be used in a QIIME workflow. So far I was only able to use it after exporting it as a tsv.

That's awesome, thanks @LenaFloerl! Would you mind opening an issue describing with a bit more detail what is the exact behaviour you'd like? Eg., some example of an action where this metadata artifact would be used would be very helpful. Danke!

nbokulich commented 3 years ago

it would be very useful to ultimately have the generated metadata file transformed, so that it can directly be used in a QIIME workflow

It looks like the necessary transformer is already in place... is this not working?

https://github.com/bokulich-lab/q2-fondue/blob/main/q2_fondue/types/_transformer.py#L31-L35

LenaFloerl commented 3 years ago

it would be very useful to ultimately have the generated metadata file transformed, so that it can directly be used in a QIIME workflow

It looks like the necessary transformer is already in place... is this not working?

https://github.com/bokulich-lab/q2-fondue/blob/main/q2_fondue/types/_transformer.py#L31-L35

Yes! I installed fondue in an existing qiime2 env and downloaded again with get-metadata. Using the metadata.qza now works fine, thanks @misialq!

bokulich-lab / q2-fondue

Efetch module refactor #26