bokulich-lab / q2-fondue

Functions for reproducibly Obtaining and Normalizing Data re-Used from Elsewhere
BSD 3-Clause "New" or "Revised" License
20 stars 6 forks source link

ENH: Add scraping of other IDs #116

Closed adamovanja closed 2 years ago

adamovanja commented 2 years ago

closes #108

This PR adds scraping of study, sample and experiment IDs to the scrape-collection action. As sample and experiment IDs are not supported by the other q2fondue actions, they are outputted into one common artifact other-ids.qza.

Testing Run the scrape-collection action on a Zotero collection of your own. Or add any of the below publications to a collection and verify that the study and "other" IDs were scraped correctly:

adamovanja commented 2 years ago

Thank you both for the reviews.

@misialq:

  1. I believe we can view the NCBIAccessionIDs type as IDs that uniquely identify particular records in NCBI databases, which both study and experiment IDs do in the SRA database. Hence, I do not see a problem here.
  2. I have encountered publications that only report experiment IDs instead of the associated study and/or BioProject IDs. Given that best practices in reporting accession IDs are not enforced, I think adding a support for these other IDs is a benefit for the user trying to retrieve all raw data used in a particular set of studies.
  3. Yes, I am planning to add support for these IDs too (new issue added #117). As for the need of scraping these IDs: when one starts off by scraping a library, it is not given that the associated run/study/BioProject ID is provided in the publication text (see also my answer in 2).

Let me know what you think and thanks for the inline comments.

@lina-kim thanks for spotting the "hyphen" issue. I actually added some special cases where IDs are not just fetched as-is (see here: https://github.com/bokulich-lab/q2-fondue/blob/0e1113661583e1f5cd948155526c7680a0538cdb/q2_fondue/scraper.py#L242-L256). But the scraper currently does not support hyphens yet. I will definitely add this in the future (see new issues #118). As for the metadata tabulate issue: Is this something you only encounter with q2-fondue outputs or with any other Q2 plugin? I personally do not have this error occurring with q2-fondue outputs or any other ones.

lina-kim commented 2 years ago

@lina-kim thanks for spotting the "hyphen" issue. I actually added some special cases where IDs are not just fetched as-is (see here:

https://github.com/bokulich-lab/q2-fondue/blob/0e1113661583e1f5cd948155526c7680a0538cdb/q2_fondue/scraper.py#L242-L256

). But the scraper currently does not support hyphens yet. I will definitely add this in the future (see new issues #118).

Thanks for addressing those cases @adamovanja!

As for the metadata tabulate issue: Is this something you only encounter with q2-fondue outputs or with any other Q2 plugin? I personally do not have this error occurring with q2-fondue outputs or any other ones.

Oh interesting. I don't have an issue with other QIIME outputs, but with fondue outputs I keep getting the following error when trying to run qiime metadata tabulate. It must be something on my end, I'll look into it.

There was an issue with loading the file metadata.qza as metadata:

  Metadata file must be encoded as UTF-8 or ASCII. The following error occurred when decoding the file:

  'utf-8' codec can't decode byte 0xb7 in position 17: invalid start byte
adamovanja commented 2 years ago

Oh interesting. I don't have an issue with other QIIME outputs, but with fondue outputs I keep getting the following error when trying to run qiime metadata tabulate. It must be something on my end, I'll look into it.

@lina-kim if the error persists, it would be great if you could open an issue on this repos with the particular set of publications with which the scraped Metadata file returns this error. I'm happy to look into it.