sdontsay commented 9 months ago

Hello,

I came across your great paper, and I am trying to run it on my device. However, it seems like I can only choose one of the cell lines ('Human' and 'Dros') for training, may I confirm with you that does this code support the training on customized scHi-C datasets?

Thanks, Sdontsay

yw7bh commented 9 months ago

The architecture itself can apply to any scHi-C datasets, the problem is whether you can prepare your dataset well to fit the model. The method for processing the datasets can also be applied to other scHi-C datasets but needs some small changes such as file path, file type, and so on.

sdontsay commented 9 months ago

Thanks for your reply. That sounds good. My follow-up question is, in the readme file, you stated that

First step enter the following folder

cd ./TrainingYourData

Second step check the environment whether it is active, if not active the environment

conda activate ScHiCEDRN.yml

Third step run the training scripts

python ScHiCEDRN_train.py -g [boolean_value] -e [epoch_number] -b [batch_size] -n [cell_number] -l [cell_line] -p [percentage]

Here above you have a "-l" option, which you said needs to be one of [human, Dros]. I understand that I might choose "human" if my customed dataset is from human cells, but what if my dataset is from mouse cells? how should I set the option?

Thanks

yw7bh commented 9 months ago

This is not the option range problem, "-l" option corresponds to a parameter in Python scripts from the dataset processing part, you should customize the dataset processing script. You can make some small changes to that part.

sdontsay commented 9 months ago

Sorry, I don't quite get it. Do you mean I should make some changes directly to your original scripts? like the one named "PrepareData_tensorM.py"

For example, let's say, I downloaded this data (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE119171), which is a mouse cell data. What you said was that I should write a new class similar to "class GSE130711Module(pl.LightningDataModule)" to make the script runable? also, for other datasets that not get written in your script, I need to this again?

Thanks

yw7bh commented 9 months ago

Yes, you need to do such things, your dataset is not in the ".cool" format according to your link, you need to extract the data content by yourself, "GSE130711Module" should be modified to fit your dataset. In general, we only adopted other model architectures, the dataset processing part is created by ourselves. Because different researchers have their specific needs to process their datasets. So, the dataset processing varies based on different needs, but the networks/architectures always are the same or similar.

sdontsay commented 9 months ago

Thanks. I have another question running the main script.

I tried to put 33 .mcool files in the Datasets/Human folder, and used the following command to run it, python ScHiCEDRN_train.py -g 1 -e 60 -b 64 -n 33 -l 'Human' -p 0.75

and I got the errors saying,

/home/.../sdontalk/lib/python3.9/site-packages/iced/normalization/_ca_utils.py:8: UserWarning: The API of this module is likely to change. Use only for testing purposes warnings.warn( Preparing the Preparations ... .. wait, first we need to split the mats wait.. first we need to extract mats and double check the mats Traceback (most recent call last): File "/home/.../ScHiCEDRN/TrainingYourData/ScHiCEDRN_train.py", line 23, in train_model = hiedsr(Gan=True, epoch=args.epoch, batch_s=args.batch_size, cellN=args.celln, celline=args.celline, percentage=args.percent) File "/home/.../ScHiCEDRN/./Pretrain/train_schicedrn_vH.py", line 69, in init DataModule.prepare_data() File "/home/.../ScHiCEDRN/./ProcessData/PrepareData_tensorH.py", line 282, in prepare_data self.split_numpy() File "/home/.../ScHiCEDRN/./ProcessData/PrepareData_tensorH.py", line 259, in split_numpy self.extract_create_numpy( ) File "/home/.../ScHiCEDRN/./ProcessData/PrepareData_tensorH.py", line 236, in extract_create_numpy self.extract_constraint_mats() File "/home/.../ScHiCEDRN/./ProcessData/PrepareData_tensorH.py", line 203, in extract_constraint_mats filepath = file_inter[0] IndexError: list index out of range

Then I tried your example data, I downloaded "L6_all_brain.txt_1kb_contacts.mcool" from the Human dataset, and put it in the Human directory, then ran the script again (changed cell num to 1), and the same error message still holds.

Do you have any idea why this error occurs?

Thanks

yw7bh commented 9 months ago

It is a file path problem for data processing, it can not find your file path. I do not know whether you looked into the details in the script, if not, there will be lots of problems in the following. I believe after this problem there be others if you do not look into script details.

sdontsay commented 9 months ago

Thanks for the reply. I figured out what was wrong.

The current hurdle is, that I have some all-zero files in the "Constraints" folder for certain chromosomes, so the program would stop there. I am wondering can I just ignore those chromosomes? If I can't, do you have any suggestions for this condition?

Moreover, I noticed that in the Human dataset you provided, you merged all the cells of one cell type to make a big .mcool file, while the scHi-C data I have at hand includes separate individual cells, I would like to know if that matters for downstream analysis if I do not merge them?

Final question, I ran your script for one of the .mcool files of Human data, it works well and gives many outputs (Contraints, Full_Mats, and Splits). I can see that "Constraints" is the intermediate output for training, but I am not sure about others. Could you help me specify which of them is the final enhanced scHiC data?

yw7bh commented 9 months ago

question 1: you can ignore anything you want, it depends on whether you fully understand each block of scripts.

question 2: single-cell in the format ".cool" can be converted to multi-resolution with the extension ".mcool", I am sure I provide some single-cell datasets. The ".mcool" of single-cell is individual rather than all cells of one cell type.

question 3: all the files (Contraints, Full_Mats, and Splits) are the results of data processing, not the training result, they are the sequential files to prepare the input file (Splits as the inputs) for training. You have lots of misunderstanding about it. The Split directory contains all data that should be divided into the training, validation, and test datasets. I think you may be the new one to the deep learning method. Lots of Deep learning-related knowledge can be learned online. Thanks.

sdontsay commented 9 months ago

To be honest, I did feel your impatience and unfriendly in the previous conversation, but I always believe that communication could avoid misunderstanding and bring possible cooperation. However, I was wrong.

If you do not want to answer any simple questions to you, you could make this repo unaccessible. If you think stupid questions do not match your advanced knowledge and good brain, you can choose to not put this project link on your paper, only answer questions from the paper reviewers. If you really do not like to answer basic questions, you could just ignore me, but not leave some stupid, ignorant words here.

Anyway, you are right I am not an expert in deep learning, I am just interested in scHi-C data analysis. I guess it should be a good idea to talk to your advisor about how you treat others interested in your lab's work.

Hope we won't see each other in the future.

yw7bh commented 9 months ago

What you need to understand is that we are doing scientific research and posting our approach to carry out our idea to solve a scientific problem. we are not a commercial company or an educational institution that has to teach you how to understand the basics of the program coding or adjust our code for a specific usage (we are not selling software). It is your task to know how to understand the coding if you need to use it for your project. I am happy to discuss with anyone interested in our scientific method and find inspiration for new ideas to make progress in this area. Thanks.

BioinfoMachineLearning / ScHiCEDRN

Question regarding cell line for training #1

First step enter the following folder

Second step check the environment whether it is active, if not active the environment

Third step run the training scripts