AtacWorks errors on custom data

asundaresan1 commented 3 years ago

Hello, I am trying to analyze my bulk ATAC samples but get the attached errors. These are ATAC from human donor tissues, so I am trying to train the model which is where I get the error. I am using Atacworks version 0.3.0. I tested with the test data (available as part of tutorial) and it worked fine without any errors.

The ATAC-seq reads are aligned to the human reference genome (hg19) using BWA. For unique alignments, duplicate reads were filtered out. The resulting uniquely mapped reads were normalized to the same read depth across all samples and converted into bigWig files using BEDTools. (genomecov -bga)

AtacWorks_train.err.txt ATACWorks_train.out.txt

ntadimeti commented 3 years ago

@asundaresan1 can you share your atacworks command please. I'm taking a look at your logs, will report back soon.

asundaresan1 commented 3 years ago

atacworks train \ --noisybw /path_to_noisybw \ --cleanbw /path_to_cleanbw \ --cleanpeakfile /path_to_cleanpeakfile \ --genome /path_to_hg19.auto.sizes \ --val_chrom chr2 \ --holdout_chrom chr10 \ --out_home "path_to_output_train" \ --exp_name "atacworks_train" \ --distributed

ntadimeti commented 3 years ago

It seems like this error is occuring because of y_true containing only one class going by this message in the log Only one class present in y_true. It might be possible that your data is unbalanced and all the peaks are being assigned to a single class. Can you verify this on your end ?

asundaresan1 commented 3 years ago

How should I verify this? I also tested this on a mouse cell line ATAC and get same error.

ntadimeti commented 3 years ago

The atacworks output directory should contain a <prefix>_train.h5 and a <prefix>_val.h5 files. You can try loading them and look at the number of 0s and 1s in the last channel of the numpy arrays in the h5 file.

If the files are small enough, please upload them so I can try it as well.

Could you also share how you generated the peak file you passed in with --cleanpeakfile flag ?

asundaresan1 commented 3 years ago

The files are big to be uploaded here. macs2 callpeak with options -f BAMPE and --keep-dup all.

avantikalal commented 3 years ago

Hi @asundaresan1, wondering if you were able to resolve this issue? If not, you can go to the bw2h5 folder produced by AtacWorks and look at the train.h5 and val.h5 files in the following way:

import h5py
import numpy as np
f=h5py.File('val.h5')  <--- (use the appropriate file name here)
np.unique(f['label_cla'], return_counts=True)

Please let us know the output of this command for your train.h5 and val.h5 files.

asundaresan1 commented 3 years ago

Hi @avantikalal and @ntadimeti, I was able to train the model only after I preprocessed my samples using https://github.com/zchiang/atacworks_analysis/tree/master/preprocessing.

However, after denoising I don’t see much improvement in the quality of my data. I have attached the profiles of all the samples before and after denoising. I can see that the scale has improved, and the profiles are smoother, but I was expecting Sample_3 to profile like Sample_4 for example. Is there anything else that I can try? TSS_profile_denoised.pdf TSS_profile_prior_to_denoising.pdf

These are denoised using https://ngc.nvidia.com/catalog/models/nvidia:atac_bulk_lowcov_1m_50m

avantikalal commented 3 years ago

Hi @asundaresan1, the model you've used is trained to make low-coverage data look like higher coverage data, so it will enhance the profile at a local scale, but it does not improve aggregate TSS score across the whole genome. We do have one model which is trained to improve TSS score: https://ngc.nvidia.com/catalog/models/nvidia:atac_bulk_lowqual_20m_20m . You could try this and see if it meets your purpose. What is the aim of your experiment, and what is the sequencing depth of your data?

asundaresan1 commented 3 years ago

Thank you for your suggestion. Let me try with that model. Is there a way to get denoised bam along with bigwigs and peaks? The sequencing depth for all the samples is between 42M - 143M We have performed human donor tissue ATAC on two different conditions and want to compare them. The profiles I sent are from one condition. The profiles of other condition look even worse. There is a lot of noise and peak calls are not good. If we can get the denoised data it should be like comparative analysis across these conditions (compare open chromatin regions,TF footprinting, etc)

avantikalal commented 3 years ago

Thanks for sharing the details @asundaresan1. Based on this, I think the model I suggested above is the best suited for your data. If that too doesn't work, we can discuss whether it may be possible for you to train your own model. Unfortunately AtacWorks cannot produce a BAM file - only bigwig/bedGraph and peak calls.

avantikalal commented 3 years ago

Hi @asundaresan1 , just checking in on this issue. Were you able to get better results using the second model?

asundaresan1 commented 3 years ago

Hi @avantikalal , yes I got better results with the model you suggested.

umasstr commented 3 years ago

Hi @avantikalal, in cases where noisy samples have coverage >= that of the clean samples, should users always forego training and use your pretrained model, nvidia:atac_bulk_lowqual_20m_20m?

For example:

If the pretrained model is not successful in reducing noise, are there training parameters that should be considered when constructing a custom model.

Thanks for the help—atacworks looks like a gamechanger!

avantikalal commented 3 years ago

Hi @umasstr, I've opened a new issue for this discussion: https://github.com/clara-parabricks/AtacWorks/issues/236

NVIDIA-Genomics-Research / AtacWorks

AtacWorks errors on custom data #221