m6anet-dataprep doesn't generate data.readcount.labelled file

NuriaDiaz commented 2 years ago

Dear developers,

when running m6anet-train I get the following error:

m6anet-train --model_config /home/diaz/anaconda3/envs/xpore2.1/lib/python3.10/site-packages/m6anet/model/configs/model_configs/prod_pooling.toml --train_config /home/diaz/anaconda3/envs/xpore2.1/lib/python3.10/site-packages/m6anet/model/configs/training_configs/oversampled.toml --save_dir ./m6anet_train_sham --device cpu --lr 0.0001 --seed 25 --epochs 30 --num_workers 32 --save_per_epoch 1 --num_iterations 5
Saving training information to ./m6anet_train_sham
Traceback (most recent call last):
  File "/home/diaz/anaconda3/envs/xpore2.1/bin/m6anet-train", line 8, in <module>
    sys.exit(main())
  File "/home/diaz/anaconda3/envs/xpore2.1/lib/python3.10/site-packages/m6anet/scripts/train.py", line 111, in main
    train_and_save(args)
  File "/home/diaz/anaconda3/envs/xpore2.1/lib/python3.10/site-packages/m6anet/scripts/train.py", line 69, in train_and_save
    train_dl, val_dl, test_dl = build_dataloader(train_config, num_workers)
  File "/home/diaz/anaconda3/envs/xpore2.1/lib/python3.10/site-packages/m6anet/utils/builder.py", line 43, in build_dataloader
    train_ds = build_dataset(train_config["dataset"], mode='Train')
  File "/home/diaz/anaconda3/envs/xpore2.1/lib/python3.10/site-packages/m6anet/utils/builder.py", line 24, in build_dataset
    return NanopolishDS(**config, mode=mode)
  File "/home/diaz/anaconda3/envs/xpore2.1/lib/python3.10/site-packages/m6anet/utils/data_utils.py", line 37, in __init__
    self.initialize_data_info(root_dir, min_reads)
  File "/home/diaz/anaconda3/envs/xpore2.1/lib/python3.10/site-packages/m6anet/utils/data_utils.py", line 92, in initialize_data_info
    read_count = pd.read_csv(os.path.join(fpath, "data.readcount.labelled"))
  File "/home/diaz/anaconda3/envs/xpore2.1/lib/python3.10/site-packages/pandas/util/_decorators.py", line 311, in wrapper
    return func(*args, **kwargs)
  File "/home/diaz/anaconda3/envs/xpore2.1/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 680, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/home/diaz/anaconda3/envs/xpore2.1/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 575, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/home/diaz/anaconda3/envs/xpore2.1/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 933, in __init__
    self._engine = self._make_engine(f, self.engine)
  File "/home/diaz/anaconda3/envs/xpore2.1/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1217, in _make_engine
    self.handles = get_handle(  # type: ignore[call-overload]
  File "/home/diaz/anaconda3/envs/xpore2.1/lib/python3.10/site-packages/pandas/io/common.py", line 789, in get_handle
    handle = open(
FileNotFoundError: [Errno 2] No such file or directory: '/nanopore/xpore_analysis/m6anet_dataprep_sham/data.readcount.labelled'

The content of the m6anet-dataprep output folder is: data.index data.json data.log data.readcount eventalign.index

So there isn't any file called data.readcount.labelled.

Could you be so kind to help me understand why am I getting this issue?

Thank you so much in advance, Núria

chrishendra93 commented 2 years ago

hi @NuriaDiaz, the training script expects a file called data.readcount.labelled which we don't provide a way to generate since this can be highly specific for the user. You can create your own data.readcount.labelled as long as it follows the standard outlined on this page of the documentation

https://m6anet.readthedocs.io/en/latest/training.html

Let me know if it works for you

Thank you

chrishendra93 commented 2 years ago

Hi @NuriaDiaz, are you able to run m6anet-train now?

NuriaDiaz commented 2 years ago

Hi @chrishendra93, I'm still not sure how to assign the labels to the positions in the data.readcount.labelled, so I can't run it. I can't really understand how to do that from the documentation provided.

chrishendra93 commented 2 years ago

hi @NuriaDiaz, currently we don't have a standard way to do it since this will depend on the label that each user has. The easiest way to do this is of course to just add another column called "modification_status" to your data.readcount file and rename it as data.readcount.labelld

NuriaDiaz commented 2 years ago

Hi @chrishendra93, sure that makes totally sense, but the problem is that I don't know how to fill this extra column. Should I give random numbers to the "modification_status" of the positions? How do I know which label I have to write in "set_type"? In your example it looks like you should know at least that some positions are modified. For that I'd need another tool, dataset or another technique which tells me which positions are indeed modified. Maybe I completely misunderstood this point from the documentations. Thanks!

LuckyMLucy commented 1 year ago

Dear developers, I ran the following code according to the Quick Start you gave，Firstly, I try to preprocess the segmented raw signal file in the form of nanopolish eventalign file using ‘m6anet-dataprep’:

It prompted some warnings but eventually succeeded, generating the five files it needed. But something went wrong when I wanted to move on to the next step：

It looks like there is a problem loading the json file, please do you know how to solve it？ Thanks so much and looking forward to your reply~

chrishendra93 commented 1 year ago

hi @LuckyMLucy, seems like the data.json is not generated properly, which results in incorrect JSON format output. Can you help to try re-running it to see if the problem persists? I've had this error in the past but cannot seem to reproduce it every time, but just to be sure let me check on this again from my end.

chrishendra93 commented 1 year ago

@NuriaDiaz pardon the late reply, I missed your reply previously and was only notified when someone posted a new issue here. The training script expects you to know which positions are modified since this is meant for re-training of m6Anet to predict m6A modifications in different species for which the current model cannot generalize or other modifications. The set_type should be imputed manually since this depends the user decision on the training, testing, and validation split

LuckyMLucy commented 1 year ago

@chrishendra93 Dear, I tried again：I use m6anet-dataprep to process data like this, the only thing is I changed the path of input and output :

Then I want to go ahead but the same problem occured. I try to load the data.json file separately like this but falied:

But when I load it line by line it works:

And this is the data.json right?

I don't know what's wrong with it yet.

chrishendra93 commented 1 year ago

hi @LuckyMLucy, I could not reproduce the error with m6anet-run_inference. The error that you saw when you did json.loads(f) is because data.json is not a valid JSON file - it is only meant to be read line by line since the inference process is done in mini-batches. m6anet-run_inference will load the data.json line by line, calling json.loads(line) each time

Can I clarify with you on the installation procedure that you went through when installing m6anet? Did you happen to modify the inference file?

LuckyMLucy commented 1 year ago

@chrishendra93 Thanks for your quick reply! I'm sorry to keep bothering you but I really can't fix this. I deleted all the files and started over, that's the whole process： Firstly I git the m6anet from github in my virtual environment:

And I begin to quickstart, just like I showed above. I change nothing, it really confused me .

chrishendra93 commented 1 year ago

Hi @LuckyMLucy this is indeed quite confusing. I have just merged the latest development branch, try deleting your virtualenv and re-creating it with python 3.8, reclone the repo (setup.py should have version 1.1.2 in there) and re-install it again. Let me know how it goes

chrishendra93 commented 1 year ago

Closing this issue since there is no more activity

GoekeLab / m6anet

m6anet-dataprep doesn't generate data.readcount.labelled file #41