MS20190155 / Measuring-Corporate-Culture-Using-Machine-Learning

Code Repository for MS20190155
135 stars 97 forks source link

Changing Dataset #4

Closed camelot2002 closed 3 years ago

camelot2002 commented 3 years ago

I wanted to change the data set but am unable to understand how you have mapped document_ids to the documents. A little clarification of that in readme.md would be really helpful. Thank you.

maifeng commented 3 years ago

The document ids are either unique IDs provided by the data vendor or they can be incremental IDs. If you have a CSV file with no other unique identifiers, you can save the row numbers as the document IDs.

camelot2002 commented 3 years ago

i dont have a csv file all i have is the data

camelot2002 commented 3 years ago

i have a ticker to differentiate different companies. But in your csv files one document has multiple document ids and i dont understand how a document has been broken down.

maifeng commented 3 years ago

One input document corresponds to one unique id. The number of rows in document file is the same as the document-id file.

camelot2002 commented 3 years ago

the document.txt in the input folder contains several documents right? and each line has a unique id okay. And also each document has a unique id. How does it differentiate between different documents in that plethora of text.

maifeng commented 3 years ago

Each line in document.txt is a unique document with line breaks removed.

camelot2002 commented 3 years ago

okay thank you.