PathwayCommons / classifier-pipeline

A workflow for classifying articles
MIT License
0 stars 0 forks source link

Fatal error in pipeline: Missing element #32

Closed jvwong closed 1 year ago

jvwong commented 1 year ago

Trying to process articles scripts/cron/cron.py from daily updates. Last added pubmed23n1223.xml.gz.

2023-02-22 13:24:39.000 | INFO     | classifier_pipeline.pubmed:prediction_print_spy:307 - Identified 99 hits from 26534 tested (0.373%);mean probability: 0.994 --- pmid: 34561789; prob=0.996
2023-02-22 13:24:39.000 | INFO     | classifier_pipeline.pubmed:prediction_print_spy:307 - Identified 100 hits from 26784 tested (0.373%);mean probability: 0.994 --- pmid: 34989308; prob=0.996
2023-02-22 13:24:39.000 | INFO     | classifier_pipeline.pubmed:_pmc_supplement_transfomer:277 - Retrieving 34 PMC IDs
Traceback (most recent call last):
  File "cron.py", line 56, in <module>
    pipeline = as_pipeline(
  File "/home/baderlab/Documents/dev/classifier-pipeline/classifier_pipeline/utils.py", line 15, in as_pipeline
    generator = step(generator)
  File "/home/baderlab/Documents/dev/classifier-pipeline/classifier_pipeline/utils.py", line 126, in exhaust
    deque(generator, maxlen=0)
  File "/home/baderlab/Documents/dev/classifier-pipeline/classifier_pipeline/utils.py", line 118, in _db_loader
    for item in items:
  File "/home/baderlab/Documents/dev/classifier-pipeline/classifier_pipeline/pubmed.py", line 279, in _pmc_supplement_transfomer
    for pubmed_chunk in pubmed_chunks:
  File "/home/baderlab/miniconda3/envs/pipeline/lib/python3.8/site-packages/ncbiutils/ncbiutils.py", line 158, in get_citations
    citations = self._parse_response(response.content)
  File "/home/baderlab/miniconda3/envs/pipeline/lib/python3.8/site-packages/ncbiutils/ncbiutils.py", line 143, in _parse_response
    return self._parse_xml(data)
  File "/home/baderlab/miniconda3/envs/pipeline/lib/python3.8/site-packages/ncbiutils/ncbiutils.py", line 138, in _parse_xml
    return list(records)
  File "/home/baderlab/miniconda3/envs/pipeline/lib/python3.8/site-packages/ncbiutils/pmcxmlparser.py", line 162, in parse
    journal = self._get_journal(pmc_article)
  File "/home/baderlab/miniconda3/envs/pipeline/lib/python3.8/site-packages/ncbiutils/pmcxmlparser.py", line 118, in _get_journal
    iso_abbreviation = self._get_iso_abbreviation(pmc_article)
  File "/home/baderlab/miniconda3/envs/pipeline/lib/python3.8/site-packages/ncbiutils/pmcxmlparser.py", line 64, in _get_iso_abbreviation
    text = _collect_element_text(isoabbrev)
  File "/home/baderlab/miniconda3/envs/pipeline/lib/python3.8/site-packages/ncbiutils/xml.py", line 50, in _collect_element_text
    return ' '.join(element.xpath('string()').split())
AttributeError: 'NoneType' object has no attribute 'xpath'
jvwong commented 1 year ago

Traced it to
https://github.com/PathwayCommons/ncbiutils/blob/556c4b14e03b42b613f65972bf6b4afe2a121408/ncbiutils/pmcxmlparser.py#L62-L66

Must be some record with a mangled iso-abbrev?

jvwong commented 1 year ago

superseded by https://github.com/PathwayCommons/ncbiutils/issues/60