calico / basenji

Sequential regulatory activity predictions with deep convolutional neural networks.
Apache License 2.0
409 stars 125 forks source link

Output unit number of Akita #81

Open frostinassiky opened 3 years ago

frostinassiky commented 3 years ago

Thanks for the amazing job!

According to the Akita Tutorial, we need to specify model parameters json to have only two targets.

params_file   = './params.json'
with open(params_file) as params_file:
    params_tutorial = json.load(params_file)   
params_tutorial['model']['head_hic'][-1]['units'] =2

ref: https://github.com/calico/basenji/blob/master/manuscripts/akita/tutorial.ipynb

However, the original parameters json have 5 targets. What are the extra 3 targets?

frostinassiky commented 3 years ago

BTW, does the learning curve look good for a production Akita training? Early stopping happens on the 35-th epoch.

akita_production

davek44 commented 3 years ago

The tutorial chooses two of the datasets arbitrarily to demonstrate the code. The primary model that we studied in the paper was trained on the five target datasets described here: https://github.com/calico/basenji/blob/master/manuscripts/akita/data/targets.txt

Yes, those training curves look good. I'm guessing your showing the training set statistics since I don't think early stopping would have chosen to stop if that were the validation set statistics.

frostinassiky commented 3 years ago

Hi @davek44 Thanks for your response! Do you have direct links for the three datasets: GM12878, IMR90, and HCT116?

davek44 commented 3 years ago

If you want all of the datasets, you should consider using the preprocessed dataset into TFRecords, which you can acquire with this script: https://github.com/calico/basenji/blob/master/manuscripts/akita/get_data.sh

If you want all of the cooler files, I added them to the cloud bucket here: https://console.cloud.google.com/storage/browser/basenji_hic/1m/data/coolers