ctds-usyd / scopus

Extracts Elsevier Scopus Snapshot to relational database
GNU General Public License v3.0
3 stars 2 forks source link

Loading error on ubuntu + mysql #25

Open davidpitl opened 6 years ago

davidpitl commented 6 years ago

Just an error but I dont know where to send it. Directory /file_150944456025_2017-10-30_ANI-CITEDBY/ contains small zip files (ex. 2017-10-30_ANICITEDBY_011-1.zip).

./extract_to_db.sh //file_1509358456025_2017-10-30_ANI-CITEDBY/ Traceback (most recent call last): File "Scopus/db_loader.py", line 424, in main() File "Scopus/db_loader.py", line 416, in main extract_and_load_docs(args.paths, pool=pool) File "Scopus/db_loader.py", line 361, in extract_and_load_docs for counter, doc_record in enumerate(imap(_process_one, xml_pairs)): File "Scopus/db_loader.py", line 266, in generate_xml_pairs if eid_filter is not None and eid_filter(int(os.path.dirname(path).rsplit('-')[-1])): ValueError: invalid literal for int() with base 10: ''

jnothman commented 6 years ago

it looks like your input data has a different naming schema to what I'm familiar with...

davidpitl commented 6 years ago

example of zip filenames in my data directory:

2017-10-30_ANICITEDBY_00-1.zip 2017-10-30_ANICITEDBY_00-2.zip 2017-10-30_ANICITEDBY_00-3.zip 2017-10-30_ANICITEDBY_00-4.zip 2017-10-30_ANICITEDBY_00-5.zip 2017-10-30_ANICITEDBY_00-6.zip ...

example of XML filenames contained on zip files:

2-s2.0-85031432039-citedby.xml 2-s2.0-85031432183-citedby.xml 2-s2.0-85031432677-citedby.xml 2-s2.0-85031432760-citedby.xml ...

El 09/11/2017 a las 8:44, Joel Nothman escribió:

it looks like your input data has a different naming schema to what I'm familiar with...

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ctds-usyd/scopus/issues/25#issuecomment-343072695, or mute the thread https://github.com/notifications/unsubscribe-auth/AdLR2GHM3X1ozzITsNC_dwgABMqLSK8wks5s0q1mgaJpZM4QXh-q.

jnothman commented 6 years ago

Well, the script is currently set up to deal with data that includes both citedby and the abstract XML, and probably won't work without that.

We have a structure like 2-s2.0-85031432760/citedby.xml. I've committed something that might help a little...

On 9 November 2017 at 20:31, davidpitl notifications@github.com wrote:

example of zip filenames in my data directory:

2017-10-30_ANICITEDBY_00-1.zip 2017-10-30_ANICITEDBY_00-2.zip 2017-10-30_ANICITEDBY_00-3.zip 2017-10-30_ANICITEDBY_00-4.zip 2017-10-30_ANICITEDBY_00-5.zip 2017-10-30_ANICITEDBY_00-6.zip ...

example of XML filenames contained on zip files:

2-s2.0-85031432039-citedby.xml 2-s2.0-85031432183-citedby.xml 2-s2.0-85031432677-citedby.xml 2-s2.0-85031432760-citedby.xml ...

El 09/11/2017 a las 8:44, Joel Nothman escribió:

it looks like your input data has a different naming schema to what I'm familiar with...

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ctds-usyd/scopus/issues/25#issuecomment-343072695, or mute the thread https://github.com/notifications/unsubscribe-auth/AdLR2GHM3X1ozzITsNC_ dwgABMqLSK8wks5s0q1mgaJpZM4QXh-q.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ctds-usyd/scopus/issues/25#issuecomment-343097895, or mute the thread https://github.com/notifications/unsubscribe-auth/AAEz6ydegSPVqxFM7PJBhKxFT4v3hHb6ks5s0sZ_gaJpZM4QXh-q .

davidpitl commented 6 years ago

Are we talking about same Scopus Custom Data xsd? Attached xsd manual.

David

El 9 de noviembre de 2017 10:59:30 CET, Joel Nothman notifications@github.com escribió:

Well, the script is currently set up to deal with data that includes both citedby and the abstract XML, and probably won't work without that.

We have a structure like 2-s2.0-85031432760/citedby.xml. I've committed something that might help a little...

On 9 November 2017 at 20:31, davidpitl notifications@github.com wrote:

example of zip filenames in my data directory:

2017-10-30_ANICITEDBY_00-1.zip 2017-10-30_ANICITEDBY_00-2.zip 2017-10-30_ANICITEDBY_00-3.zip 2017-10-30_ANICITEDBY_00-4.zip 2017-10-30_ANICITEDBY_00-5.zip 2017-10-30_ANICITEDBY_00-6.zip ...

example of XML filenames contained on zip files:

2-s2.0-85031432039-citedby.xml 2-s2.0-85031432183-citedby.xml 2-s2.0-85031432677-citedby.xml 2-s2.0-85031432760-citedby.xml ...

El 09/11/2017 a las 8:44, Joel Nothman escribió:

it looks like your input data has a different naming schema to what I'm familiar with...

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub

https://github.com/ctds-usyd/scopus/issues/25#issuecomment-343072695, or mute the thread

https://github.com/notifications/unsubscribe-auth/AdLR2GHM3X1ozzITsNC_ dwgABMqLSK8wks5s0q1mgaJpZM4QXh-q.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub

https://github.com/ctds-usyd/scopus/issues/25#issuecomment-343097895, or mute the thread

https://github.com/notifications/unsubscribe-auth/AAEz6ydegSPVqxFM7PJBhKxFT4v3hHb6ks5s0sZ_gaJpZM4QXh-q .

-- You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub: https://github.com/ctds-usyd/scopus/issues/25#issuecomment-343104995

-- Enviado desde mi teléfono con K-9 Mail.

jnothman commented 6 years ago

Very possibly not. We get the cited_by files but more too

davidpitl commented 6 years ago

I agree with you. Now I've files of type: 2017-10-30_ANI_04-xml-5.zip ... and this other type: 2017-10-30_ANICITEDBY_00-1.zip

but still I get a new error:

Another question ... How to include new schema attributes, like
affiliation ...?

Thanks in advance, David

Joel Nothman notifications@github.com escribió:

Very possibly not. We get the cited_by files but more too

-- You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub: https://github.com/ctds-usyd/scopus/issues/25#issuecomment-343285253

jnothman commented 6 years ago

Affiliation is already in the schema. I've fixed that bug, sorry.

davidpitl commented 6 years ago

Now I get attached error logs. My Scopus custom data has two types of files: 2017-10-30_ANI_00-xml-1.zip and 2017-10-30_ANICITEDBY_011-1.zip each log file corresponds to each type of file execution.

I can send you example XML contained on it.

My direct email is: david.perez@inv.uam.es

David

Joel Nothman notifications@github.com escribió:

Affiliation is already in the schema. I've fixed that bug, sorry.

-- You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub: https://github.com/ctds-usyd/scopus/issues/25#issuecomment-343660417

++ pwd

++ pwd