calico / basenji

Sequential regulatory activity predictions with deep convolutional neural networks.
Apache License 2.0
410 stars 126 forks source link

Clarifying data processing pipeline #103

Closed adazhang1 closed 3 years ago

adazhang1 commented 3 years ago

Hi @davek44

Thanks for making your code and data public!

I am trying to compare my model / data to basenji model / data, and have been digging through the .tfr files at https://console.cloud.google.com/storage/browser/basenji_barnyard/data.

The 2020 "Cross-species..." paper says that log fold change signal tracks were downloaded from ENCODE, high values were soft-clipped to 32, and negative values were clipped to zero. From my understanding, ENCODE only provides "fold change" not "log fold change" tracks - I therefore assumed that the ENCODE fold change tracks were soft clipped, then pushed through a log function, then negative values were clipped to zero.

The above mentioned .tfr files look like soft-clipped, fold change tracks, scaled by 2. Looking through the basenji code, I haven't (yet) found any further data processing - i.e., taking a log and clipping negatives - after data import.

Could you help me understand - did you import the data and then take a log and clip negative values later in the pipeline? ...Or did you train on these soft-clipped fold change tracks? Or perhaps I misunderstood something else?

Thanks so much for your help, I really appreciate your time!

Ada

davek44 commented 3 years ago

Hi Ada, perhaps I misunderstood what I had downloaded from ENCODE. I thought I was looking at log fold change; I remember seeing negative values when I initially browsed around. If the ENCODE files weren't logged, I didn't add a log. So you could recreate by simply clipping the high end. I did also scale by 2 because it gave me slightly better results, but it didn't matter that much.

adazhang1 commented 3 years ago

Got it - thank you so much for your reply! (Also, I hope you have a great Thanksgiving holiday!)