PlanTL-GOB-ES / SciELO-Spain-Crawler

[PlanTL/medicine/dataset generation/retrieval] Crawler to download all the publications written in Spanish from the Spanish SciELO server.
MIT License
1 stars 1 forks source link

Crawler crashes #1

Open avacaondata opened 1 year ago

avacaondata commented 1 year ago

It seems there is an issue with xml structure, the error message is as follows:

Getting Scielo journals.
Parsing Scielo journals XML.
[Fatal Error] scielo-sets.xml:1:42: El marcador en el documento que aparece tras el elemento raíz debe tener el formato correcto.
Error parsing journals.
0 journals found.
Saving Scielo journals info.
Number of Scielo documents downloaded: 0
Date saved: 2023-02-22
Extracting new publication XMLs from Scielo.
Finished downloading all XMLs from new publications.
Number of Scielo full text documents downloaded: 0
Number of Scielo Dublin Cores extended: 0
Failed records: 0
Number of raw text and XMLs extracted from publications: 0
Failed records: 0
Finished creating TSV file from the complete Scielo corpus.
Couldn't create TSV lines for 0 records.

I'd appreciate some help, thank you very much :) @anderintxa

mariaega11 commented 11 months ago

Hello,

I have the same error. Is there any update related with it? Thanks