daenuprobst / molzip

The gzip classification method implemented for molecule classification.
MIT License
52 stars 10 forks source link

TODO (For ME) #11

Open PowersPope opened 11 months ago

PowersPope commented 11 months ago

Apply the same regression and/or classification approaches currently, but try all possible gzip compression levels. My intuition says that varying levels of compression will add different levels of contextual information. This could possible get us better performance in the beginning.

PowersPope commented 11 months ago

Update on the Gzip Compression Levels test I ran. I got some interesting results. I varied the compression levels from 0 - 9 (with 0 being no compression and 9 being the highest compression). When compression was 0 I was still encoding the string in utf-8.

If you look at the graphs below for both the Valid and Test Datasets it seemed like compression 0 with just utf-8 encoding did better in most places. Though in the case for MUV it seems to be getting better with compression (at least in the case of Valid).

Anybody have any idea/thoughts on why this is happening?

TEST RMSE_TEST_CLASSIFICATION

VALID RMSE_VALID_CLASSIFICATION

janweinreich commented 11 months ago

This is very interesting thanks for testing this! I am certainly not an expert on compression algorithms, but from what I understand:

If compression level is increased gzip spends more time searching for patterns in the data. For instance, consider the string "ABABABA". This can be compressed as "ABA(Back 3, Length 4)" where "Back 3, Length 4" is a reference to a previous string of length 4 that occurs 3 characters back.

it is unfortunate that level 0 seems to perform best though! Possibly the higher level patterns don't add much additional information for ML. Did you check if the filesize actually decreases still with increasing compression?

If you set compression level to 9 but no further stuff can be compressed then I'd expect that the distances between points we get from gzip should be similar and therefore also the error on the testset

PowersPope commented 11 months ago

Thanks for the overview of how Gzip compression works. I didn't know the specifics.

I think that explains a little on what is happening with Gzip and SMILES. I imagine there are a lot of repeating patterns since SMILES are mostly made up of only a handle full of characters. In particular (N,C,H,O). My guess is that the compression lengths are relatively similar across the board. Seems like a relatively easy enough thing that I could test, by just creating a length distribution of compression and non-compression.

I didn't check the filesize, I edited some functions and created an additional function, so that I could run a bash script to increase the input compression level from 0-9. You can pass the argument compresslevel in gzip.compress(foo, compresslevel=[0-9]). Since gzip_classifier was already using it, I just implemented it that way.

I was curious about looking into a Lossy compression algo. My intuitions tell me that the sequences would be diverse, since features are reduced/meaned out. Though I don't know.