Closed rahulbordoloi closed 4 years ago
Thanks @rahulbordoloi does this happen when you are converting to CSV?
Accidentally closed this on my phone, but I reopened it!
@edsu Yes, the JSONL format is working fine but when I try the same with converting it to CSV the above error shows up. And the CSV File created is of 0KB ie undefined.
Ok, I suspect that there is a corrupted line in the JSONL file. I will try this too.
Tho, the chrome json parser works fine in my case, do you want me to send you the tweets id csv from which I hydrated the following jsonl and csv? If yes, you can drop off your mail here.
Sure I am ehs@pobox.com
Sure I am ehs@pobox.com
Sent. Please Check Once.
Hydrator appears to be writing a blank line to the JSONL file when none of the tweet ids can be hdyrated. This is pretty rare unless the tweet ids have been corrupted in some way. But nevertheless it should not write blank lines because they could be problematic for downstream users of the JSONL that are attempting to find a complete JSON object on each line.
I noticed that there were blank lines in the SP.jsonl file you gave me @rahulbordoloi. I'm testing whether this causes problems for the CSV generation.
Do the blank lines in between create a problem for me to work on with the JSONLs?
It depends on how you are processing them. Some JSON parsers may not care about being asked to parse a blank line. But it was the case that the Hydrator did care, it was throwing the error you reported when it was attempting to parse a blank line.
When you get a chance could you give v0.0.12 a try and see if your problem generating the CSV goes away?
Unfortunately I think your tweet id file has been corrupted. Do you see how they all end in 4 zeros? That is a good indicator that something processed the tweet ids that was unaware of overflow errors. I don't know if the file was that way when you downloaded it, or if Excel or some other tool mangled it. But it was useful here because the corrupted ids helped find a small bug in the Hydrator that wouldn't ordinarily get thrown.
Regarding the Tweet IDs, I think the file might have been corrupted while downloading or I might have corrupted it somehow by opening it in Excel. Can you suggest me a batter way to go through the Tweet IDs beforehand without changing their type and corrupting them, just for a look through?
And Yes, I can now generate a CSV File now from the Generated JSONL. Thank You for the Fix and Happy to Contribute to such a handy and wonderful Project. Great Work. :)
I've Mailed you both the JSONL and CSV File. Please check if it's alright according to your desired output.
I recommend you use a text editor like VSCode, Emacs or Vim to inspect the ids. You may be able to open them in Excel but don't save them again, or else they will overflow and become useless. Thanks for reporting this issue. I'm closing for now since it seems like the latest version of Hydrator will not write blank newlines to the JSONL in these cases where large numbers of tweet ids cannot be hydrated.
OS : Windows 10
Problem : Each time I try hydrating big dataset, I'm facing this Error. Though this error do-not show up for smaller dataset. Can you please look after this issue? The Screenshot for the same has been attached.