DOAJ / doaj

The Directory of Open Access Journals - website and directory software
Apache License 2.0
57 stars 17 forks source link

CSV file incorrect #99

Closed dommitchell closed 10 years ago

dommitchell commented 10 years ago

From a user: /snip/ Our import job detected that the http://www.doaj.org/csv data feed today only contained 3911 entries while there had been over 8000 in the same list previously. Your web page says that the database contains 9804 journals. Some journals that can be found through search function on your site are missing from the CSV file (e.g. Acta Palaeobotanica). /snip/ Today I only get 1965 rows.

emanuil-tolev commented 10 years ago

confirmed

richard-jones commented 10 years ago

Emanuil - could this be related to the ES problem we saw earlier? If one or more of the shards went down, the csv wouldn't have access to the complete dataset, so would show a reduced number?

On 17 February 2014 10:00, Emanuil Tolev notifications@github.com wrote:

confirmed

— Reply to this email directly or view it on GitHubhttps://github.com/DOAJ/doaj/issues/99#issuecomment-35242950 .

Richard Jones,

Founder, Cottage Labs t: @richard_d_jones, @cottagelabs w: http://cottagelabs.com

emanuil-tolev commented 10 years ago

No, that was me making an SSH tunnel to the production server so I could get direct, secure access to ES to send queries to it (I thought I needed to run against production data to see the exact numbers for Dom's question on which journals had an alternative title).

I got my access, but obviously localhost:9200 stopped responding to the actual .. localhost :). And was only forwarded to me. I presume.

emanuil-tolev commented 10 years ago

Issue fixed on live site now, fix in 8a93b9171f69e8a7ad78960f779afe5e45af5530 .

dommitchell commented 10 years ago

Might be unrelated. User feedback: "Unfortunately there is still no data in the ‘publication fee’ column. Could you send me a file that includes this data?"

Can you remind me if these fields should now be included or is it an old data-->new data patching issue?

richard-jones commented 10 years ago

There is just no data in this field yet, as this is part of the new data model.

With the new data model in place, we probably need to re-visit how the search results, csv and oai-pmh DC are presented, to ensure best coverage.

On 22 May 2014 09:08, dommitchell notifications@github.com wrote:

Might be unrelated. User feedback: "Unfortunately there is still no data in the ‘publication fee’ column. Could you send me a file that includes this data?"

Can you remind me if these fields should now be included or is it an old data-->new data patching issue?

— Reply to this email directly or view it on GitHubhttps://github.com/DOAJ/doaj/issues/99#issuecomment-43859455 .

Richard Jones,

Founder, Cottage Labs t: @richard_d_jones, @cottagelabs w: http://cottagelabs.com

dommitchell commented 10 years ago

So even if a journal has been accepted and provided that information, the data doesn't go into the fields in the csv, right? That requires additional work?

richard-jones commented 10 years ago

Yup, that' s right. We didn't include any work on the user-facing side of the site when we implemented the new application form.

On 22 May 2014 10:14, dommitchell notifications@github.com wrote:

So even if a journal has been accepted and provided that information, the data doesn't go into the fields in the csv, right? That requires additional work?

— Reply to this email directly or view it on GitHubhttps://github.com/DOAJ/doaj/issues/99#issuecomment-43865046 .

Richard Jones,

Founder, Cottage Labs t: @richard_d_jones, @cottagelabs w: http://cottagelabs.com

dommitchell commented 10 years ago

How about journals accepted not making it into the file at all?

/snip/ when i download http://doaj.org/csv the result is a file named doaj_20140507_1330_utf8.csv (attached as zip-file). This file is missing new entries starting with 5/8/2014. Two examples: Earth Surface Dynamics, added 2014-05-08 Earth Surface Dynamics Discussions, added 2014-06-06 /snip/

emanuil-tolev commented 10 years ago

Hm, looks like it's not regenerating properly. Gonna check it.

emanuil-tolev commented 10 years ago

Yep, data problem or some such causing csv to fail to regenerate, need to investigate a bit more.

cloo@yonce:~/cron-logs$ tail -20 doaj-journal-csv_2014-06-12_1500.log
Running in production
Loaded final config from /opt/doaj/src/doaj/production.cfg
Traceback (most recent call last):
  File "/opt/doaj/src/doaj/portality/scripts/journalcsv.py", line 41, in <module>
    thecsv += get_csv_string(j.csv())
  File "/opt/doaj/src/doaj/portality/models.py", line 842, in csv
    row.append( multival_sep.join(bibjson.language))
TypeError
emanuil-tolev commented 10 years ago

CSV regenerating OK again. "Earth Surface Dynamics" in there, csv says it was added 6th June 2014.

When a journal does not have "language" defined at all, the csv generation broke. I've made it assume (correctly) that if "language" is not defined at all, then it's just an empty list. 3706a0285f6beb694275aace1e9dc70ea766f342