HubertTang / PLASMe

21 stars 4 forks source link

I was wondering if you could provide the reference chromosome dataset used for training this model? #6

Open ZoeLct opened 10 months ago

ZoeLct commented 10 months ago

Hi, I was wondering if you could provide the reference chromosome dataset used for training this model? Also, could you clarify what the labels for the .tsv files related to these chromosomes should be? Are they labeled as chromosomes, various types of bacteria, or non-plasmids? I'm new to this field and not very familiar with the related knowledge, so I hope to get your answer!

HubertTang commented 10 months ago

Hi ZoeLct,

I have uploaded all the reference chromosomes into the folder of supplementary data (OneDrive, rep_chrom_comp.fna), please take a look. I labeled them as chromosomes.

Best, Xubo

ZoeLct commented 9 months ago

Thank you so much for your previous response. I have a question about training the model. I only see the requirements for inputting training set data and validation set data, but I don't see any requirements for inputting plasmid and chromosome labels. For example, train_pos_path = f"path/to/pos.fna" and train_pos_data_dir = f"path/to/pos". Should I put the label files in this directory for the training and testing sets? Is there a problem with my understanding? Looking forward to hearing from you.

HubertTang commented 9 months ago

Hi ZoeLct,

In this example training script, I demonstrated a binary classification task where "pos" represents positive samples and "neg" represents negative samples. Therefore, there is no need to configure additional files for labels. Simply place the positive sample (plasmid) sequences and negative sample (chromosome) sequences in the 'path/to/pos.fna' and 'path/to/neg.fna', respectively. The train_pos_data_dir is the folder where the script generates the data, and you simply need to set its path in the script.

Best, Xubo

HubertTang commented 9 months ago

BTW, the reference chromosomes and plasmids that I have uploaded are complete. As mentioned in the paper, it is necessary to downsample the sequences to obtain shorter sequences for training data to achieve better model performance. You can refer to the sampling method described in the paper, we sample the sequences using sliding windows of 200, 400, 600, 800, 1000, 2000 and 4000 bp (the stride length is half of the window size). Alternatively, you can directly randomly sample from a normal distribution on the sequences.