Closed bartaelterman closed 9 years ago
Hi @bartaelterman, I'm implementing based on your script but have a few more questions:
Thanks!
Hi @niconoe, indeed, some things are not correct in the script.
Allright, thanks!
Does that make sense to store the text from CSV in both text and cleaned_text attribute? Or should I make text optional and keep it empty?
Hmm.. Maybe best to save it in text and cleaned text.
Import script written, but I'm facing a small issue: we have a constraint that consider an article a duplicate if it has the same journal, publication date and title. Some source data doesn't look like duplicates but still fails the test.
It happens for example with generic titles such as "REACTIES", when the publishing time is set at midnight... For example, with the command:
$ cat articles_with_score.tsv | grep "'REACTIES'"
You'll see different articles called "REACTIES", with a publication time of "Mon Jul 13 00 00 00 2009"
Should we remove our constraint? Or change it so it only fails if text or epu_score is identical too? That seems a bit weird (especially the text option), but maybe that's a pragmatic solution... What do you think?
I've been looking at the articles you're pointing too. Indeed, for some the constraint is too stringent. I would say, let's add the text in the constraint. However, for the imported scripts, the text will be saved in cleaned_text
so we should add that to the constraint too.
Hmmm, there's an additional issue there: text and cleaned_text cannot be added to the constraint, since this constraints implies an index, and the large text fields are too big to be indexed by Postgres...
I had a look at the initial issue (#54) about duplicates, but it seems we're a bit short of options here. I don't know how vital is this constraint... Is it to avoid that the scraper accidentally adds twice the same article? In that case, if the scraper always return an URL, maybe we can set the field as unique, but still allowing NULL values for the old articles.
Any other suggestion?
If setting the URL to unique does not imply that NULL values are not allowed, then that suggestion is ok!
Indeed, the SQL standard considers NULL values meaning "unknown", so duplicate NULLs are still allowed in case of unique constraints.
Script finally works!
See #50 with link to tmp script.