Modification of n_read_per_site

DayeaPark commented 1 year ago

Hi.

I currently run m6Anet with pre-trained model (Hct116_RNA002). I wonder if there is any way that I can change the n_read_per_site from 20 to 10. I changed n_read_per_site in the model toml file (prod_pooling.toml) but has error. You any help to make modifition on read threshold will be helpful for me. Thank you.

ValueError: Length of values (86428) does not match length of index (43214)

my modified toml file looks like this.

model = "prod_sigmoid_pooling"

[[block]] block_type = "DeaggregateNanopolish" num_neighboring_features = 1

[[block]] block_type = "KmerMultipleEmbedding" input_channel = 66 output_channel = 2 num_neighboring_features = 1

[[block]] block_type = "ConcatenateFeatures"

[[block]] block_type = "Linear" input_channel = 15 output_channel = 150 activation = "relu" batch_norm = true

[[block]] block_type = "Linear" input_channel = 150 output_channel = 32 activation = "relu" batch_norm = false

[[block]] block_type = "SigmoidProdPooling" input_channel = 32 n_reads_per_site = 10

kristinrma commented 1 year ago

Hi @DayeaPark, Both the min_reads variable in the training config file and the n_reads_per_site variable in the model config should be changed. Sorry for the potential confusion in the documentation. Let us know if that solves your issue.

DayeaPark commented 1 year ago

Thanks for the response, @kristinrma. I am using conda environment to install m6Anet. Could you let me know where I can find the both training config and model config? I created model config and provide the file when I run m6Anet inference, however I cannot find where I can change the training config.

I guess I can make training model again providing training config to run m6Anet-train. However I have problem in the step when I run this. In training config, I need to edit root_dir and nor_path. Since I am using your preset data, I used your norm_path data (norm_factors_hct116.joblib) but I cannot find the labelled file for root_dir. In your description. I need to provide the root_dir which includes data.info.labelled file and data.json file. Could you let me know where I can find the those files?

I just want to use your pre-trained model with n_read threshold as 1. If you have any idea to solve this problem without re-training model, please let me know. Thank you.

My training config looks like this. [loss_function] loss_function_type = "binary_cross_entropy_loss"

[dataset] root_dir = "/path/to/m6anet/m6anet/tests/data/" min_reads = 10 norm_path = "/path/to/m6anet/m6anet/model/norm_factors/norm_factors_hct116.joblib" num_neighboring_features = 1

[dataloader] [dataloader.train] batch_size = 256 sampler = "ImbalanceOverSampler"

[dataloader.val] batch_size = 256 shuffle = false

[dataloader.test] batch_size = 256 shuffle = false

Thanks for your help!

kristinrma commented 1 year ago

Hi @DayeaPark,

Glad you were able to find the sample training config and modify it. To replicate the pre-trained model with the minimum number of reads at 10, I would suggest running m6Anet dataprep using the SGNex Hct116 Rep2 Run1 dataset as this was the original dataset used to train m6Anet. A tutorial on how to retrieve files from the SGNex AWS S3 bucket can be found here https://github.com/GoekeLab/sg-nex-data/blob/master/docs/AWS_data_access_tutorial.md. After you generate the data prep files, you can follow the m6Anet training documentation to create data.info.labelled from your data.info set; then set root_dir to your dataprep folder. Hope this helps.

GoekeLab / m6anet

Modification of n_read_per_site #131