Closed ZuzanaSebb closed 3 months ago
Hi Lina! I did a few changes in the repo followed by a few bug fixes.
I followed you recommendations and changed the arguments names in __cli__.py
entrezpy_clients/__pycache__/
is removed
it would be helpful to discuss the data input formats before updating the code and the Readme
file regarding the possible inputs. In the current version we support the inputs in a form: pdf_analysis --pmc_ids 9714783 7802287 -o res.txt
For the long inputs users are expected to pass a shell variable .
I did just a few minor changes inside of the entrezpy_clients
regarding the import of the used functions, as we discussed the plan is to make a separate module out of it.
hey @ZuzanaSebb and @lina-kim, thanks for sharing and reviewing! Just a quick note re changing the repo name to mishmash - I fully agree with this @lina-kim - the ORD part was for our internal use only so no need to preserve.
Reviving this thread a bit late, thanks for your patience. Thanks for your changes and responses @ZuzanaSebb, and for your input Anton.
pip install parametrized=66.0.2
à la pyproject.ml
, but we shouldn't expect this of a regular user. Is there a way we can streamline this or prevent this error from happening? ERROR: Failed building wheel for parametrized
Running setup.py clean for parametrized
Building wheel for docopt (setup.py) ... done
Created wheel for docopt: filename=docopt-0.6.2-py2.py3-none-any.whl size=13706 sha256=dd56bb87cc485cdf7a5725145f147ae2f93a35600cef94dc27c0f9b35dfeec0e
Stored in directory: /private/var/folders/my/lnw4vhzn7917kd_cb8pcwwfw0000gq/T/pip-ephem-wheel-cache-ynagtyk4/wheels/1a/b0/8c/4b75c4116c31f83c8f9f047231251e13cc74481cca4a78a9ce
Successfully built ord_mishmash bs4 lxml docopt
Failed to build parametrized
scrape pdf_analysis
. It runs okay when I input a PMC ID with no SRA IDs to find (negative control, PMC6240460), but not when I input a PMC ID with a known SRA ID (PMC9921707). Have you run into this?LookupError:
**********************************************************************
Resource punkt not found.
Please use the NLTK Downloader to obtain the resource:
>>> import nltk
>>> nltk.download('punkt')
For more information see: https://www.nltk.org/data.html
Attempted to load tokenizers/punkt/PY3/english.pickle
Searched in:
- '/Users/linkim/nltk_data'
- '/Users/linkim/Documents/Work/Software/anaconda3/envs/test_mm/nltk_data'
- '/Users/linkim/Documents/Work/Software/anaconda3/envs/test_mm/share/nltk_data'
- '/Users/linkim/Documents/Work/Software/anaconda3/envs/test_mm/lib/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
- ''
**********************************************************************
Hi @lina-kim,
I changed the python requirements to be >=3.9
.
I cleaned the dependencies so you shouldn't get the ERROR: Failed building wheel for parametrized
.
The problem with the PMC id PMC6240460
is actually an expected behaviour. When you check the api: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=PMC6240460
it doesn't return the expected paper XML just a message: The publisher of this article does not allow downloading of the full text in XML form.
You should be able to see this message on the command line as well. The csv file is still returned, while you can still have some valid PMC ids in your id collection for which you will get the information. All the invalid ones will come enumerated in the message before the output file.
- I changed the python requirements to be
>=3.9
.- I cleaned the dependencies so you shouldn't get the
ERROR: Failed building wheel for parametrized
.
These work beautifully now, thank you!
- The problem with the PMC id
PMC6240460
is actually an expected behaviour. When you check the api:https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=PMC6240460
it doesn't return the expected paper XML just a message:The publisher of this article does not allow downloading of the full text in XML form.
You should be able to see this message on the command line as well. The csv file is still returned, while you can still have some valid PMC ids in your id collection for which you will get the information. All the invalid ones will come enumerated in the message before the output file.
Oh no, I meant my issues went the other way around. PMC6240460
worked okay and gave me the error message I wanted:
$ scrape pdf_analysis --pubmed_central_ids PMC6240460 --output_file test1.tsv
Papers represented by followin PMC ids was not fetched, the publisher of this article does not allow downloading of the full text in XML form:
PMC6240460
Result successfully obtained!
Result saved to test1.tsv
It was the other PMC ID which gave me the odd NLTK error, even in a clean environment:
$ scrape pdf_analysis --pubmed_central_ids PMC9921707 --output_file test2.tsv
Traceback (most recent call last):
File "/Users/linkim/.venv/bin/scrape", line 8, in <module>
sys.exit(main())
File "/Users/linkim/.venv/lib/python3.9/site-packages/ord_mishmash/cli.py", line 34, in main
df_analysis = pdf_analysis(args.pmc_ids)
File "/Users/linkim/.venv/lib/python3.9/site-packages/ord_mishmash/scrape_pdf.py", line 287, in pdf_analysis
'method': [el.parse_method()]
File "/Users/linkim/.venv/lib/python3.9/site-packages/ord_mishmash/scrape_pdf.py", line 232, in parse_method
for sentence in sent_tokenize(self.get_text())]
File "/Users/linkim/.venv/lib/python3.9/site-packages/nltk/tokenize/__init__.py", line 106, in sent_tokenize
tokenizer = load(f"tokenizers/punkt/{language}.pickle")
File "/Users/linkim/.venv/lib/python3.9/site-packages/nltk/data.py", line 750, in load
opened_resource = _open(resource_url)
File "/Users/linkim/.venv/lib/python3.9/site-packages/nltk/data.py", line 876, in _open
return find(path_, path + [""]).open()
File "/Users/linkim/.venv/lib/python3.9/site-packages/nltk/data.py", line 583, in find
raise LookupError(resource_not_found)
LookupError:
**********************************************************************
Resource punkt not found.
Please use the NLTK Downloader to obtain the resource:
>>> import nltk
>>> nltk.download('punkt')
For more information see: https://www.nltk.org/data.html
Attempted to load tokenizers/punkt/PY3/english.pickle
Searched in:
- '/Users/linkim/nltk_data'
- '/Users/linkim/.venv/nltk_data'
- '/Users/linkim/.venv/share/nltk_data'
- '/Users/linkim/.venv/lib/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
- ''
**********************************************************************
At first I wondered if this article was restricted, but it does appear to be part of the PMC Open Access Subset and even has a CC BY 4.0 license. What could be the issue?
PMC9921707
Is it still happening with this PMC9921707
pmc_id? I can't reproduce the error, and I am able to obtain the output. I hoped that the requirements cleanup would fix it.
Is it still happening with this
PMC9921707
pmc_id? I can't reproduce the error, and I am able to obtain the output. I hoped that the requirements cleanup would fix it.
Sadly yes, I still get this error with a clean environment. Digging in, it looks like punkt
is a pre-trained model associated with the nltk
package; but this is not downloaded with nltk
installation. I am able to run the code successfully after manually downloading punkt
with python -m nltk.downloader punkt
. But ideally, this would be an automated download during the install.
If you remove (or move) $HOME/nltk_data/
, create a clean environment, and then try running scrape pdf_analysis
-- do you still get a successful run?
The output of database_name
(just {'N'}
) is currently uninformative for input accession number "PRJNA607574"
. This is a result of the function get_accession_tuples()
returning the output of (functionally) re.findall(r'(PRJ(E|D|N)[A-Z][0-9]+)', "PRJNA607574")
. There should be an intermediate step in which the E, D, and N are associated with the name of the database (ENA, DDBJ, or SRA) to output to the user.
Done.
The Python package contains 2 main command line commands:
Get metadata returns a csv file of fetched metadata to corresponding accession_ids, operates upon existing - modified submodule of q2_fondue plugin. Pdf_analysis returns a data-frame (csv file output) of the xml analysis of the corresponding pmc ids -publications.