ccsasuke / man

Multinomial Adversarial Networks for Multi-Domain Text Classification (NAACL 2018)
https://arxiv.org/abs/1802.05694
MIT License
80 stars 28 forks source link

Missing Code for Generating Pickle Files #1

Closed Humanity123 closed 6 years ago

Humanity123 commented 6 years ago

Can you provide a description of the data present in the Pickle files for Multi Domain Amazon Dataset? Or code for generating these pickle files

ccsasuke commented 6 years ago

We took the preprocessed version of the Amazon dataset from the mSDA paper (https://www.cse.wustl.edu/~mchen/code/mSDA.tar) that comes in a Matlab format (.mat).

I guess I further converted that into a pickle file and did some reorganization of the data, such as splitting the data into "labeled" and "unlabeled" sets, etc. Sorry I can no longer locate the script I used for this conversion, but I can briefly describe the structure of the pkl file. The relevant code for consuming the pkl file can be found in data_prep/msda_preprocessed_amazon_dataset.py

The pickled file contains a single dictionary with two keys "labeled" and "unlabeled". For each of the two items, it contains a tuple (x, y), where x is a 2d numpy array containing the input while y is a 1d numpy array with the labels. You can also modify msda_preprocessed_amazon_dataset.py to use with other data formats.

Humanity123 commented 6 years ago

Hey,

I looked at the pickle files. The matrix was of the shape (Number of data points, 30000). I wanted to test your implementation on different Amazon dataset. I believe that 30000 feature vector was constructed using the unigrams and bigram features. Can you explain the process of generating the feature vector of size 30000.

On Wed, Jul 11, 2018 at 2:47 AM, Xilun Chen notifications@github.com wrote:

We took the preprocessed version of the Amazon dataset from the mSDA paper (https://www.cse.wustl.edu/~mchen/code/mSDA.tar) that comes in a Matlab format (.mat).

I guess I further converted that into a pickle file and did some reorganization of the data, such as splitting the data into "labeled" and "unlabeled" sets, etc. Sorry I can no longer locate the script I used for this conversion, but I can briefly describe the structure of the pkl file. The relevant for consuming the pkl file can be found in data_prep/msdapreprocessed amazon_dataset.py

It contains a single dictionary with two keys "labeled" and "unlabeled". For either of the two items, it contains a tuple (x, y), where x is a 2d numpy array containing the input while y is a 1d numpy array with the labels. You can also modify the script to use with other data formats.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ccsasuke/man/issues/1#issuecomment-403909197, or mute the thread https://github.com/notifications/unsubscribe-auth/ATsP8V9LuDXD_MtiAqF0rKnoFkjHfXyAks5uFOjAgaJpZM4VJP3T .

ccsasuke commented 6 years ago

Hi,

As I mentioned, the 30k dimensional matrix is taken from the mSDA paper by Chen et al. (2012), so I cannot speak for them about how exactly they processed the dataset. However, I believe they simply took the most frequent 30000 features (and we only use the top 5k to be consistent to some earlier work we compared as baselines).