DEV: Initial PR of scraper

ZuzanaSebb commented 9 months ago

The Python package contains 2 main command line commands:

get_metadata
pdf_analysis

Get metadata returns a csv file of fetched metadata to corresponding accession_ids, operates upon existing - modified submodule of q2_fondue plugin. Pdf_analysis returns a data-frame (csv file output) of the xml analysis of the corresponding pmc ids -publications.

ZuzanaSebb commented 8 months ago

Hi Lina! I did a few changes in the repo followed by a few bug fixes.

I followed you recommendations and changed the arguments names in __cli__.py
entrezpy_clients/__pycache__/ is removed
it would be helpful to discuss the data input formats before updating the code and the Readme file regarding the possible inputs. In the current version we support the inputs in a form: pdf_analysis --pmc_ids 9714783 7802287 -o res.txt For the long inputs users are expected to pass a shell variable .
I did just a few minor changes inside of the entrezpy_clients regarding the import of the used functions, as we discussed the plan is to make a separate module out of it.

alavrinienko commented 7 months ago

hey @ZuzanaSebb and @lina-kim, thanks for sharing and reviewing! Just a quick note re changing the repo name to mishmash - I fully agree with this @lina-kim - the ORD part was for our internal use only so no need to preserve.

lina-kim commented 6 months ago

Reviving this thread a bit late, thanks for your patience. Thanks for your changes and responses @ZuzanaSebb, and for your input Anton.

Are the Python requirements too stringent? I see it's <4.0,>=3.11, but Python 3.11 was released barely more than a year ago. Probably best practice to be on the latest version, but I am notoriously late (I was on 3.10 and my initial install failed) and others may be too. If this is not changed, we should note for the user that Python 3.11 is the minimum Python version.
I'm running into a setup problem when I try to pip install this package in a new environment. It works okay after I manually pip install parametrized=66.0.2 à la pyproject.ml, but we shouldn't expect this of a regular user. Is there a way we can streamline this or prevent this error from happening?

  ERROR: Failed building wheel for parametrized
  Running setup.py clean for parametrized
  Building wheel for docopt (setup.py) ... done
  Created wheel for docopt: filename=docopt-0.6.2-py2.py3-none-any.whl size=13706 sha256=dd56bb87cc485cdf7a5725145f147ae2f93a35600cef94dc27c0f9b35dfeec0e
  Stored in directory: /private/var/folders/my/lnw4vhzn7917kd_cb8pcwwfw0000gq/T/pip-ephem-wheel-cache-ynagtyk4/wheels/1a/b0/8c/4b75c4116c31f83c8f9f047231251e13cc74481cca4a78a9ce
Successfully built ord_mishmash bs4 lxml docopt
Failed to build parametrized

An issue I haven't figured out yet is the odd behavior of scrape pdf_analysis. It runs okay when I input a PMC ID with no SRA IDs to find (negative control, PMC6240460), but not when I input a PMC ID with a known SRA ID (PMC9921707). Have you run into this?

LookupError:
**********************************************************************
  Resource punkt not found.
  Please use the NLTK Downloader to obtain the resource:

  >>> import nltk
  >>> nltk.download('punkt')

  For more information see: https://www.nltk.org/data.html

  Attempted to load tokenizers/punkt/PY3/english.pickle

  Searched in:
    - '/Users/linkim/nltk_data'
    - '/Users/linkim/Documents/Work/Software/anaconda3/envs/test_mm/nltk_data'
    - '/Users/linkim/Documents/Work/Software/anaconda3/envs/test_mm/share/nltk_data'
    - '/Users/linkim/Documents/Work/Software/anaconda3/envs/test_mm/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - ''
**********************************************************************

ZuzanaSebb commented 6 months ago

Hi @lina-kim,

I changed the python requirements to be >=3.9.
I cleaned the dependencies so you shouldn't get the ERROR: Failed building wheel for parametrized.
The problem with the PMC id PMC6240460 is actually an expected behaviour. When you check the api: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=PMC6240460 it doesn't return the expected paper XML just a message: The publisher of this article does not allow downloading of the full text in XML form. You should be able to see this message on the command line as well. The csv file is still returned, while you can still have some valid PMC ids in your id collection for which you will get the information. All the invalid ones will come enumerated in the message before the output file.

lina-kim commented 6 months ago

I changed the python requirements to be >=3.9.

I cleaned the dependencies so you shouldn't get the ERROR: Failed building wheel for parametrized.

These work beautifully now, thank you!

The problem with the PMC id PMC6240460 is actually an expected behaviour. When you check the api: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=PMC6240460 it doesn't return the expected paper XML just a message: The publisher of this article does not allow downloading of the full text in XML form. You should be able to see this message on the command line as well. The csv file is still returned, while you can still have some valid PMC ids in your id collection for which you will get the information. All the invalid ones will come enumerated in the message before the output file.

Oh no, I meant my issues went the other way around. PMC6240460 worked okay and gave me the error message I wanted:

$ scrape pdf_analysis --pubmed_central_ids PMC6240460 --output_file test1.tsv
Papers represented by followin PMC ids was not fetched, the publisher of this article does not allow downloading of the full text in XML form:
PMC6240460
Result successfully obtained!
Result saved to test1.tsv

It was the other PMC ID which gave me the odd NLTK error, even in a clean environment:

$ scrape pdf_analysis --pubmed_central_ids PMC9921707 --output_file test2.tsv
Traceback (most recent call last):
  File "/Users/linkim/.venv/bin/scrape", line 8, in <module>
    sys.exit(main())
  File "/Users/linkim/.venv/lib/python3.9/site-packages/ord_mishmash/cli.py", line 34, in main
    df_analysis = pdf_analysis(args.pmc_ids)
  File "/Users/linkim/.venv/lib/python3.9/site-packages/ord_mishmash/scrape_pdf.py", line 287, in pdf_analysis
    'method': [el.parse_method()]
  File "/Users/linkim/.venv/lib/python3.9/site-packages/ord_mishmash/scrape_pdf.py", line 232, in parse_method
    for sentence in sent_tokenize(self.get_text())]
  File "/Users/linkim/.venv/lib/python3.9/site-packages/nltk/tokenize/__init__.py", line 106, in sent_tokenize
    tokenizer = load(f"tokenizers/punkt/{language}.pickle")
  File "/Users/linkim/.venv/lib/python3.9/site-packages/nltk/data.py", line 750, in load
    opened_resource = _open(resource_url)
  File "/Users/linkim/.venv/lib/python3.9/site-packages/nltk/data.py", line 876, in _open
    return find(path_, path + [""]).open()
  File "/Users/linkim/.venv/lib/python3.9/site-packages/nltk/data.py", line 583, in find
    raise LookupError(resource_not_found)
LookupError:
**********************************************************************
  Resource punkt not found.
  Please use the NLTK Downloader to obtain the resource:

  >>> import nltk
  >>> nltk.download('punkt')

  For more information see: https://www.nltk.org/data.html

  Attempted to load tokenizers/punkt/PY3/english.pickle

  Searched in:
    - '/Users/linkim/nltk_data'
    - '/Users/linkim/.venv/nltk_data'
    - '/Users/linkim/.venv/share/nltk_data'
    - '/Users/linkim/.venv/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - ''
**********************************************************************

At first I wondered if this article was restricted, but it does appear to be part of the PMC Open Access Subset and even has a CC BY 4.0 license. What could be the issue?

ZuzanaSebb commented 6 months ago

PMC9921707

Is it still happening with this PMC9921707 pmc_id? I can't reproduce the error, and I am able to obtain the output. I hoped that the requirements cleanup would fix it.

lina-kim commented 6 months ago

Is it still happening with this PMC9921707 pmc_id? I can't reproduce the error, and I am able to obtain the output. I hoped that the requirements cleanup would fix it.

Sadly yes, I still get this error with a clean environment. Digging in, it looks like punkt is a pre-trained model associated with the nltk package; but this is not downloaded with nltk installation. I am able to run the code successfully after manually downloading punkt with python -m nltk.downloader punkt. But ideally, this would be an automated download during the install.

If you remove (or move) $HOME/nltk_data/, create a clean environment, and then try running scrape pdf_analysis -- do you still get a successful run?

lina-kim commented 3 months ago

The output of database_name (just {'N'}) is currently uninformative for input accession number "PRJNA607574". This is a result of the function get_accession_tuples() returning the output of (functionally) re.findall(r'(PRJ(E|D|N)[A-Z][0-9]+)', "PRJNA607574"). There should be an intermediate step in which the E, D, and N are associated with the name of the database (ENA, DDBJ, or SRA) to output to the user.

ZuzanaSebb commented 3 months ago

Done.

bokulich-lab / mishmash

DEV: Initial PR of scraper #1