NVIDIA-Genomics-Research / AtacWorks

Deep learning based processing of Atac-seq data
https://clara-parabricks.github.io/AtacWorks/
Other
128 stars 23 forks source link

Trying to pip install on a cluster #195

Closed hugokitano closed 3 years ago

hugokitano commented 4 years ago

Hi, I'm trying to install atacworks on a cluster where I do not have root permissions. I have installed all the dependencies and tried to do pip install ., but I ran into this error

ERROR: Could not install packages due to an EnvironmentError: [('/oak/stanford/groups/satpathy/users/hkitano/AtacWorks/.git/objects/pack/pack-047c57777251a6aed da1d732d0df6a7a7889d91d.pack', '/tmp/pip-req-build-sdfdqve5/.git/objects/pack/pack-047c57777251a6aedda1d732d0df6a7a7889d91d.pack', "[Errno 13] Permission denie d: '/tmp/pip-req-build-sdfdqve5/.git/objects/pack/pack-047c57777251a6aedda1d732d0df6a7a7889d91d.pack'"), ('/oak/stanford/groups/satpathy/users/hkitano/AtacWork s/.git/objects/pack/pack-047c57777251a6aedda1d732d0df6a7a7889d91d.idx', '/tmp/pip-req-build-sdfdqve5/.git/objects/pack/pack-047c57777251a6aedda1d732d0df6a7a788 9d91d.idx', "[Errno 13] Permission denied: '/tmp/pip-req-build-sdfdqve5/.git/objects/pack/pack-047c57777251a6aedda1d732d0df6a7a7889d91d.idx'")]

I then tried to do "pip install --user ." but the error persisted. Any ideas? Thank you!

Hugo

ntadimeti commented 4 years ago

Hi @hugokitano ,

Could you share how you installed the dependencies? Could you also share if you are running in any virtual env or conda or docker or just local installation ?

hugokitano commented 4 years ago

I am running in a conda environment, and used my environment's version of pip to install the requirements and macs2. I also tried installing with pip3 to the same error.

ntadimeti commented 4 years ago

You are able to install the requirements fine with pip but are having an issue with only pip install . step ?

hugokitano commented 4 years ago

yes. The requirements step worked, but not the pip install . step. I'm pretty sure it must be an issue with me not having root access privileges on the cluster, since I was able to install AtacWorks fine using the same method locally. However, pip installing with --user fails as well.

ntadimeti commented 4 years ago

Hm, can you try this command : python -m pip install .. I haven't encountered this before, let me see how I can help you

hugokitano commented 4 years ago

thanks for helping. I get the same kind of error ERROR: Could not install packages due to an EnvironmentError: [('/oak/stanford/groups/satpathy/users/hkitano/AtacWorks/.git/objects/pack/pack-047c57777251a6aed da1d732d0df6a7a7889d91d.pack', '/tmp/pip-req-build-nbe44k80/.git/objects/pack/pack-047c57777251a6aedda1d732d0df6a7a7889d91d.pack', "[Errno 13] Permission denie d: '/tmp/pip-req-build-nbe44k80/.git/objects/pack/pack-047c57777251a6aedda1d732d0df6a7a7889d91d.pack'"), ('/oak/stanford/groups/satpathy/users/hkitano/AtacWork s/.git/objects/pack/pack-047c57777251a6aedda1d732d0df6a7a7889d91d.idx', '/tmp/pip-req-build-nbe44k80/.git/objects/pack/pack-047c57777251a6aedda1d732d0df6a7a788 9d91d.idx', "[Errno 13] Permission denied: '/tmp/pip-req-build-nbe44k80/.git/objects/pack/pack-047c57777251a6aedda1d732d0df6a7a7889d91d.idx'")]

ntadimeti commented 4 years ago

Could you try upgrading the setuptools and re-run the pip install ? Try the following command : pip install --upgrade setuptools && pip install .

hugokitano commented 4 years ago

It successfully upgraded from setuptools-46.1.3 to setuptools-49.2.0, but I got the same error for the pip install .

ntadimeti commented 4 years ago

Another idea is to change your tmp directory. You can do that via TMPDIR env variable. Change it to a location where you do have permission to write.

hugokitano commented 4 years ago

So I changed my TMPDIR variable to ~/tmp, but the error persists. The error keeps mentioning the /tmp directory, which is what 'TMPDIR' was set to before I changed it. Yet the error message seems to reference directories that aren't in the /tmp directory image

ntadimeti commented 4 years ago

Could you share the new TMPDIR path ?

hugokitano commented 4 years ago

i just created the ~/tmp directory, so it is empty. Should I try creating a new conda environment with this new TMPDIR?

ntadimeti commented 4 years ago

Please run export TMPDIR=/home/$USER/tmp to update the TMPDIR variable to the new directory. Then running pip install . might pick up the newly created tmp directory. Have you done this already ?

hugokitano commented 4 years ago

I got the following error, which is different than previous ones:

Processing /oak/stanford/groups/satpathy/users/hkitano/AtacWorks ERROR: Could not install packages due to an EnvironmentError: [('/oak/stanford/groups/satpathy/users/hkitano/AtacWorks/.git/objects/pack/pack-047c57777251a6aed da1d732d0df6a7a7889d91d.pack', '/home/users/hkitano/tmp/pip-req-build-1mvbkli7/.git/objects/pack/pack-047c57777251a6aedda1d732d0df6a7a7889d91d.pack', "[Errno 1 3] Permission denied: '/home/users/hkitano/tmp/pip-req-build-1mvbkli7/.git/objects/pack/pack-047c57777251a6aedda1d732d0df6a7a7889d91d.pack'"), ('/oak/stanford/ groups/satpathy/users/hkitano/AtacWorks/.git/objects/pack/pack-047c57777251a6aedda1d732d0df6a7a7889d91d.idx', '/home/users/hkitano/tmp/pip-req-build-1mvbkli7/. git/objects/pack/pack-047c57777251a6aedda1d732d0df6a7a7889d91d.idx', "[Errno 13] Permission denied: '/home/users/hkitano/tmp/pip-req-build-1mvbkli7/.git/object s/pack/pack-047c57777251a6aedda1d732d0df6a7a7889d91d.idx'")] So it looks like my TMPDIR was changed, but the problem persists

hugokitano commented 4 years ago

Is there a way I can make Atacworks work without doing the pip install .? There should be, right? Since all of the scripts are self-contained in this repository.

ntadimeti commented 4 years ago

Hugo,

Could you give the below commands a try ?

export TMPDIR=$HOME/tmp
mkdir -p $TMPDIR
pip install --user .

I will in the meantime explore ways you can run without having to install the libraries. Can you run a docker container in your cluster ? If so, I can share atacworks docker container with you.

hugokitano commented 4 years ago

Unfortunately, still running into the same type of error.

Processing /oak/stanford/groups/satpathy/users/hkitano/AtacWorks ERROR: Could not install packages due to an EnvironmentError: [('/oak/stanford/groups/satpathy/users/hkitano/AtacWorks/.git/objects/pack/pack-047c57777251a6aed da1d732d0df6a7a7889d91d.pack', '/home/users/hkitano/tmp/pip-req-build-vuyvppdk/.git/objects/pack/pack-047c57777251a6aedda1d732d0df6a7a7889d91d.pack', "[Errno 1 3] Permission denied: '/home/users/hkitano/tmp/pip-req-build-vuyvppdk/.git/objects/pack/pack-047c57777251a6aedda1d732d0df6a7a7889d91d.pack'"), ('/oak/stanford/ groups/satpathy/users/hkitano/AtacWorks/.git/objects/pack/pack-047c57777251a6aedda1d732d0df6a7a7889d91d.idx', '/home/users/hkitano/tmp/pip-req-build-vuyvppdk/. git/objects/pack/pack-047c57777251a6aedda1d732d0df6a7a7889d91d.idx', "[Errno 13] Permission denied: '/home/users/hkitano/tmp/pip-req-build-vuyvppdk/.git/object s/pack/pack-047c57777251a6aedda1d732d0df6a7a7889d91d.idx'")]

From reading setup.py, it seems the pyclaragenomics folder is copied into a temporary directory created by pip. This must be where it fails. I've tried a number of different temporary directory paths and they all fail this permissions, despite the fact I should have full permissions to these directories (I can write files to them, for example).

My cluster does not use docker because it needs root access, though it does use Singularity.

ntadimeti commented 4 years ago

Could you do a chmod -R 777 /home/users/hkitano/tmp/ ?

hugokitano commented 4 years ago

yeah, same error

ntadimeti commented 4 years ago

Hugo,

It's a weird issue and I'm out of ideas for how to solve it. There are two options we could try as workarounds for this.

1) Move the files inside the scripts/ folder into the atacworks root dir cd AtacWorks && mv scripts/* .

If you do this, you will have to run each script directly for example by running python main.py etc. Which branch have you cloned? Master or dev-v0.3.0 ?

2) Second option is, I can create a local wheel package and upload it somewhere. You can download this wheel package and install atacworks using this. (Not sure if this will lead to similar/other permission errors)

Let me know which option you find feasible. If you choose 1, we will have to adapt the tutorials etc, to this change.

hugokitano commented 4 years ago

I'm in master, version 0.2.3. I think option 1 is fine. Running from main.py should be almost exactly the same, right? I'll just have to do scripts/main.py train --args instead of atacworks train --args, right?

ntadimeti commented 4 years ago

Right, you would have to move the files outside of scripts, so the module paths are accessible. So, you would do python main.py train instead of python scripts/main.py train. Hopefully that works. Let me know how it goes.

hugokitano commented 4 years ago

ah i see, i'll give it a try

hugokitano commented 4 years ago

Ok, I think it works. I did have to do a sys.path.append("") before I imported from claragenomics.io in the peak2bw.py file, which I expect I might have to do for other files going forward. But it does work!

hugokitano commented 4 years ago

I have an unrelated question: what does the bigwig layers option accomplish? Could you therefore input a noisy peak bigwig as further training data along with the noisy track bigwig?

avantikalal commented 4 years ago

Hi Hugo,

Yes, the layers option can be used to provide additional training data along with the noisy coverage track. In our preprint, we used this to include the positions of CTCF motifs for an experiment where we adapted the model to predict CTCF binding instead of chromatin accessibility.

For the standard use case of denoising and peak calling from ATAC data, we have tried using this option to supply a noisy peak track, and in our experiment it did not help.

hugokitano commented 4 years ago

And, sorry, one last question: If I was to call main.py eval on the model I trained in tutorials 1 (I'm on the master branch, 0.2.3), what would my command be? I know that the trained model is not good, I just want to know how it works. Thank you!

Hugo

ntadimeti commented 4 years ago

Hugo,

First you have to generate a holdout h5 file. You can use the bw2h5.py script to achieve that.

python $atacworks/scripts/bw2h5.py \
           --noisybw dsc.1.Mono.50.cutsites.smoothed.200.bw \
           --intervals holdout_intervals.bed \
           --out_dir ./ \
           --prefix Mono.50.2400.holdout \
           --pad 5000 \
           --nolabel

Now, you can do main.py eval -h to get the help options to run atacworks eval. In general, you would have to provide the following : 1) the model weights using the --weights_path. In tutorial1 case, it will be model.best.pth.tar 2) The h5 file generated from the command above using --files option. In your case the file will be Mono.50.2400.holdout.h5 3) Chromosome sizes file using the --sizes_file option. In your case the file is $atacworks/data/reference/hg19.auto.sizes

It will look something like below :

python main.py eval --weights_path <path-to-model-weights> --files <path-to-h5> --sizes_file $atacworks/data/reference/hg19.auto.sizes --config configs/infer_config.yaml

The infer_config.yaml can be used for both inference and evaluation as they share almost all parameters.

Hope the following helps, let me know if you have any other questions.

hugokitano commented 4 years ago

One last question: what range should the scores of the bigwig be around? If I'm trying to train with set of bigwigs that have different ranges of scores, should I somehow standardize them first? For example, I'm working with bigwigs that have ranges from 0 to about 0.03 (which I know won't train very well), 0 to about 5 (which should work well), and 0 to about 800 (i'm not sure whether this would work). Or would you recommend I only choose training files from the same experiment?

avantikalal commented 4 years ago

Hi @hugokitano, happy to help but I'm not sure I understand the experiment you're trying to perform. For our experiments, we have not tried to normalize the ranges of the bigwig files in any way. If you explain the experiment you're trying to do (why do your bigWig files have different ranges? What is the goal of the model you want to train? What do you plan to apply it to?) I can offer more suggestions.

hugokitano commented 4 years ago

Sorry for not being clear, and appreciate the help. I'm trying to train a model that predicts, from a basal chromatin state, a future state when the cell encounters a cytokine in its environment. For example, if I input a control atac-seq bigwig, I'd like to map that to a bigwig that predicts what that control would look like in contact with IL-4, interferon gamma, etc.

I thought AtacWorks would be a nice way to achieve some sort of baseline model. I'm trying to amass some control and il-4 atac-seq bigwigs (these are from mouse macrophages) to train it, but the bigwigs have different signal-to-noise ratios so the score values in the bigwigs have different ranges, as mentioned above.

One of the bigwigs I'm working with has score ranges from 0 to 0.02 with most values being extremely close to 0, and when I tried to simply train AtacWorks on that, the model simply assigned almost every base-pair a value of 0 (haha!). The bigwigs look good - I can see the peaks and everything - but the range is too small for the model to learn.

I would guess your team did not encounter this problem, and this might be something that should be done in pre-processing.

avantikalal commented 4 years ago

We haven't considered this kind of use case. But in general, all the bigwig files you use for training and testing should be processed the same way and should have similar ranges. I would suggest either using data from a single experiment, or else try to get the BAM files for the datasets you want to use and process them all into bigWig files using the same procedure.

Note that in our default infer_config file (https://github.com/clara-parabricks/AtacWorks/blob/dev-v0.3.0/configs/infer_config.yaml), we've set the default value of reg_rounding to 0. This parameter causes all regression outputs to be rounded to 0 decimal places (integer values). So a value like 0.02 would be written as 0. If you want decimal resolution in your output bigWig files, you can try modifying this parameter to 1, 2, or 3.

hugokitano commented 4 years ago

great, thank you for the response! It's helpful! Why would the default classification rounding be 10^-3, since it is a binary classification problem? In general, what was the idea behind using rounding at all?

avantikalal commented 4 years ago

The idea behind rounding was to reduce the file size. If we write full float values the output bedGraph files become very large and the extra precision is not really useful.

Thanks for pointing out the issue with the classification value! We had set it to 3 because in our original experiments we were writing the probability of each base belonging to a peak, and the probability values range from 0 to 1. But for the current default setting of writing binary peak calls this is not useful. @ntadimeti do you want to change this default value to 0?

hugokitano commented 4 years ago

Hi, running into a weird problem when I am running evaluation on a trained model:

`(atacworks) [hkitano@sh02-13n14 /oak/stanford/groups/satpathy/users/hkitano/AtacWorks]$ python main.py eval \

    --config configs/infer_config.yaml \
    --config_mparams configs/model_structure.yaml \
    --weights_path experiment_output/sc_atac_reg_latest/model_best.pth.tar \
    --files scatac_baseline/holdout.h5 \
    --sizes_file data/reference/mm10.auto.sizes  \
    --intervals_file scatac_baseline/holdout_intervals.bed 

Building model: resnet ... Loading model weights from experiment_output/sc_atac_reg_latest/model_best.pth.tar... Finished loading. Finished building. Eval for 41 batches Inference -------------------- [ 0/41] Evaluating on 50000 points. Evaluation result: mse: 0.0000 | corrcoef: nan Evaluation time taken: 187.080s INFO:2020-08-17 15:48:36,123:AtacWorks-main] Waiting for writer to finish... Writing the output to bigwig files sort: cannot read: experiment_output/sc_atac_reg_inference_2020.08.17_15.45/holdout_inferred.track.bedGraph: No such file or directory Process Process-2: Traceback (most recent call last): File "/home/users/hkitano/miniconda3/envs/atacworks/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap self.run() File "/home/users/hkitano/miniconda3/envs/atacworks/lib/python3.6/multiprocessing/process.py", line 93, in run self._target(*self._args, **self._kwargs) File "main.py", line 284, in writer deletebg=deletebg, sort=True) File "/oak/stanford/groups/satpathy/users/hkitano/AtacWorks/claragenomics/io/bigwigio.py", line 167, in bedgraph_to_bigwig env=sort_env) File "/home/users/hkitano/miniconda3/envs/atacworks/lib/python3.6/subprocess.py", line 311, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['sort', '-u', '-k1,1', '-k2,2n', 'experiment_output/sc_atac_reg_inference_2020.08.17_15.45/holdout_inferred.track.bedGraph', '-o', 'experiment_output/sc_atac_reg_inference_2020.08.17_15.45/holdout_inferred.track.bedGraph']' returned non-zero exit status 2.`

The directory experiment_output/sc_atac_reg_inference_2020.08.17_15.45/ is empty. Looks like the output bedgraph was never created. It's difficult for me to debug since the "writer" is happening in a different process. Here's the config file, I'm using 0 workers. Thanks!

`#

Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.

#

NVIDIA CORPORATION and its licensors retain all intellectual property

and proprietary rights in and to this software, related documentation

and any modifications thereto. Any use, reproduction, disclosure or

distribution of this software and related documentation without an express

license agreement from NVIDIA CORPORATION is strictly prohibited.

#

Experiment args

out_home: 'experiment_output/' label: 'sc_atac_reg_inference' task: 'regression' print_freq: 50 bs: 64 num_workers: 0 weights_path: "None" gpu: 0 distributed: False dist-url: 'tcp://127.0.0.1:4321' dist-backend: 'gloo' debug: False

Data processing args

pad: 5000 transform: "None" layers: "None"

Infer args

files: "None" intervals_file: "None" sizes_file: "None" reg_rounding: 5 cla_rounding: 5 infer_threshold: 0.5 batches_per_worker: 16

Output file args

result_fname: "inferred" gen_bigwig: True deletebg: False

Eval options

best_metric_choice: "CorrCoef" threshold: 0.5`

hugokitano commented 4 years ago

Oh I think I figured out why this is happening: Looking at my evaluation metrics printed out, it seems as if my predictions for the holdout data are all 0. Thus, all scores are 0, and thus the bedgraph is not written in save_to_bedgraph in main.py. Does this seem correct?

ntadimeti commented 4 years ago

That seems correct. By default, the scores are thresholded at 0.5 and all scores less than 0.5 are set to 0 and are not written. You can set infer_threshold: 0 and threshold: 0 in your configs. Then you will be able to get the score outputs as probabilities, it will give you an idea of model's predictions.

ntadimeti commented 4 years ago

Also, can I ask if you ran eval on your own data or is this the tutorial data ?

hugokitano commented 4 years ago

It is my own data, so I'm not expecting it to work necessarily. I can set these thresholds in train_config.yaml, right?

ntadimeti commented 4 years ago

It should be set in infer_config.yaml. That's the config file for both inference and evaluation.

hugokitano commented 4 years ago

Got it. How did you end up at the weighting of the loss function that you end up using for the model? Why is mse_weight so small compared to pearson loss?

avantikalal commented 4 years ago

We used untransformed count data for our experiments. On such data, while Pearson correlation is limited to the range 0-1, MSE can be much higher. So it was necessary to down-weight the MSE loss so that both loss functions become somewhat comparable. If you are transforming or normalizing your count data in some way you may want to modify these weights.

hugokitano commented 4 years ago

Just wanted to reiterate how thankful I am for your help. I've been weirded out by a few of my results so I decided to train on the tutorial1 data just to make sure everything is working as I think it should be. I'm on version 0.2.3. I made the h5 files using bw2h5.py as described, and I was able to look into them and make sure they looked good. For example, Mono.50.2400.train.h5 should have three keys ('input', 'label_cla', and 'label_reg') and ['input'][0].sum() = 402.0

However, during training, I printed out the three keys for each datapoint in around line 62 of train.py: print(x.sum(), y_reg.sum(), y_cla.sum())

I got a strange result: it looks as if the train_dataset is repeating the same training point for every index!

(atacworks) [hkitano@sh02-16n06 /oak/stanford/groups/satpathy/users/hkitano/AtacWorks]$ python main.py train \
>         --config tutorial1/configs/train_config.yaml \
>         --config_mparams tutorial1/configs/model_structure.yaml \
>         --files_train tutorial1/Mono.50.2400.train.h5 \
>         --val_files tutorial1/Mono.50.2400.val.h5
INFO:2020-08-20 13:16:58,844:AtacWorks-main] Running on GPU: 0
Building model: resnet ...
Finished building.
Saving config file to ./trained_models_2020.08.20_13.16/configs/model_structure.yaml...
> /oak/stanford/groups/satpathy/users/hkitano/AtacWorks/claragenomics/dl4atac/train.py(54)train()
-> model.train()
(Pdb) 
(Pdb) c
Num_batches 500; rank 0, gpu 0
tensor(25728.) tensor(3217280.) tensor(50048.)
Epoch [ 0/25] -------------------- [  0/500] mse:  20.142 | pearsonloss:   0.986 | total_loss:   1.603 | bce:   0.607
tensor(25728.) tensor(3217280.) tensor(50048.)
tensor(25728.) tensor(3217280.) tensor(50048.)
tensor(25728.) tensor(3217280.) tensor(50048.)
tensor(25728.) tensor(3217280.) tensor(50048.)
tensor(25728.) tensor(3217280.) tensor(50048.)
tensor(25728.) tensor(3217280.) tensor(50048.)
tensor(25728.) tensor(3217280.) tensor(50048.)
tensor(25728.) tensor(3217280.) tensor(50048.)
tensor(25728.) tensor(3217280.) tensor(50048.)
tensor(25728.) tensor(3217280.) tensor(50048.)
tensor(25728.) tensor(3217280.) tensor(50048.)
tensor(25728.) tensor(3217280.) tensor(50048.)
tensor(25728.) tensor(3217280.) tensor(50048.)
tensor(25728.) tensor(3217280.) tensor(50048.)
tensor(25728.) tensor(3217280.) tensor(50048.)
tensor(25728.) tensor(3217280.) tensor(50048.)
tensor(25728.) tensor(3217280.) tensor(50048.)
tensor(25728.) tensor(3217280.) tensor(50048.)
tensor(25728.) tensor(3217280.) tensor(50048.)
tensor(25728.) tensor(3217280.) tensor(50048.)
tensor(25728.) tensor(3217280.) tensor(50048.)
tensor(25728.) tensor(3217280.) tensor(50048.)
tensor(25728.) tensor(3217280.) tensor(50048.)
tensor(25728.) tensor(3217280.) tensor(50048.)
tensor(25728.) tensor(3217280.) tensor(50048.)
tensor(25728.) tensor(3217280.) tensor(50048.)
tensor(25728.) tensor(3217280.) tensor(50048.)
tensor(25728.) tensor(3217280.) tensor(50048.)
tensor(25728.) tensor(3217280.) tensor(50048.)
^C
Program interrupted. 

Then, I went into the debugger and looked at the dataset: it seems to be simply returning the same datapoint every time

Screen Shot 2020-08-20 at 1 26 29 PM

I'm pretty sure I didn't change anything in the DatasetTrain class, and the DatasetInfer class seems to work fine. Has this been fixed in any subsequent update?

ntadimeti commented 4 years ago

Thanks for looking into it. I am able to reproduce what you are seeing. I have to investigate further to see why it is happening. We haven't intentionally fixed anything in the new release, but I will test and see if it still happens. I will create a new issue for this, so it's easy to track and fix.

hugokitano commented 4 years ago

great, thank you!

ntadimeti commented 4 years ago

@hugokitano Please see Pull request #208 for the fix. I am writing unit test to prevent such errors in the future and then we will merge this PR. In the mean time, you can make this change in the files you are working with and continue with your experiments.

I will ping you once the latest changes are merged. Thanks!

hugokitano commented 4 years ago

Thank you! So the only pertinent fix is the single line change in dataset.py, correct?

ntadimeti commented 4 years ago

that's right.

hugokitano commented 4 years ago

Hi,

What were your metrics on your final trained model? I was able to train a model with very high AUROC and specificity, though recall could be improved. Thank you!

ntadimeti commented 3 years ago

@avantikalal Do you recall the metrics for our models ?