Open JunjieCheng opened 6 years ago
@JunjieCheng it is the binary code that is acceptable by the tensorflow for testing. it is like a pre-process data for testing. The code is accepting the binary data, which fast in reading by system. If you wish not to convert to binary then you can change the code as per your needs as it is openly available. Please do not ask what to change as this is what you have to make and if you have any issue, ask here.
I opened the file by 'rb', and the file contains many unconverted characters
Then I tried to process them by myself. Split the article and abstract and write them to separate file, but here is an error after processing most files:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 3131: invalid start byte
How can I get a clean article and abstract from these files?