Closed lanphan closed 8 years ago
.sh
script under input/
as we do for input/articles.tsv.sh
.grep '\\r' *.jsonl
you can see many articles contain \r
. To filter these, we actually had a gsub
in input/articles.tsv.sh
of spouse example that takes care of this. Maybe we should add a comment to make this more apparent.The intended way to run the whole corpus of signalmedia-1m, is to put the directory under input/
so the file sits at input/signalmedia/signalmedia-1m.jsonl
. That way, the input/articles.tsv.sh
will pick up the .jsonl
file and apply the necessary filters. You may want to remove the grep
commands to not drop any articles. Please reopen if you find more issues.
@netj
I only comment "head -100 |" in order to get all relevant data.
I think current gsub is not enough, see my attached file below to see result of grep '\\r' input/articles-1m.tsv
with articles-1m.tsv is an output of input/articles.tsv.sh
@netj I rename "still_contains_r.txt" to "articles.tsv" and run "deepdive do sentences", I got the same error. Would you please re-open this bug? (don't see Open or Reopen here).
Ps: thanks to your feedback, I understand that input/articles.tsv.sh
is as pre-process step for Deepdive also (runs inside Deepdive, not separated step).
Yes, I confirmed there seems to be issues with the existing \r
handling within jq. I'll fix this asap and update here. Meanwhile, you could just add a good old sed
line, which is probably going to be safe and more complete than before keeping carriage-return-phobic PostgreSQL happy:
cat "$corpus" |
#grep -E 'wife|husband|married' |
#head -100 |
jq -r '[.id, .content] | @tsv' |
# take care of carriage returns
sed 's/\\r//g'
Hi all,
I got error below when doing "deepdive do sentences" in quickstart example ("has spouse" example) with full dataset from signalmedia (1 million records):
Content of data from document 1060ad64-521f-46c7-a804-4181d97f9bf0 is:
Google around, I see that it's an error in copy command of postgres. I have some questions: 1/ Is there anyway to pre-process data to prevent this bug happen again? 2/ Is there any known bug like this, so that we can collect and create a specific pre-process step to prevent all of them once? 3/ Deepdive has any mechanism to log these errors + skip them in order to continue to run?