abisee / cnn-dailymail

Code to obtain the CNN / Daily Mail dataset (non-anonymized) for summarization
MIT License
635 stars 306 forks source link

What are these characters in the bin file? #20

Open JunjieCheng opened 6 years ago

JunjieCheng commented 6 years ago

I opened the file by 'rb', and the file contains many unconverted characters

with open('/users/cheng/NLP/Data/finished_files/chunked/test_000.bin', 'rb') as file:
    for line in file:
        print(line)
b'R\x1e\x00\x00\x00\x00\x00\x00\n'
b'\xcf<\n'
b'\xf0\x02\n'
b'\x08abstract\x12\xe3\x02\n'
b'\xe0\x02\n'
b"\xdd\x02<s> marseille prosecutor says `` so far no videos were used in the crash investigation '' despite media reports . </s> <s> journalists at bild and paris match are `` very confident '' the video clip is real , an editor says . </s> <s> andreas lubitz had informed his lufthansa training school of an episode of severe depression , airline says . </s>\n"
b'\xd99\n'
b'\x07article\x12\xcd9\n'
b'\xca9\n'

Then I tried to process them by myself. Split the article and abstract and write them to separate file, but here is an error after processing most files:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 3131: invalid start byte

How can I get a clean article and abstract from these files?

JafferWilson commented 6 years ago

@JunjieCheng it is the binary code that is acceptable by the tensorflow for testing. it is like a pre-process data for testing. The code is accepting the binary data, which fast in reading by system. If you wish not to convert to binary then you can change the code as per your needs as it is openly available. Please do not ask what to change as this is what you have to make and if you have any issue, ask here.