GuanLab / Leopard

16 stars 5 forks source link

Issue when lifting over one-hot encoded DNA bigwigs from hg19 to hg38 #9

Closed Al-Murphy closed 3 years ago

Al-Murphy commented 3 years ago

Hi,

I am using leopard to train on a large dataset of epigenomic data which is based on hg38. I thus tried to liftover the one-hot encoded DNA bigwigs, liver bigwig and Average bigwig to hg39 from hg19. Unfortunately, the A,T,G bigwigs all failed when using CrossMap. Here is the error message for the T.bigwig file:

2021-07-26 11:03:45 [INFO]  Read the chain file "./data_download/liftover/hg19ToHg38.over.chain.gz"
2021-07-26 11:03:46 [INFO]  Liftover bigwig file ./data_download/dna/T_hg19.bigWig to bedGraph file ./data_download/dna/T.bgr:
2021-07-26 02:39:45 [INFO]  Merging overlapped entries in bedGraph file
2021-07-26 02:39:45 [INFO]  Sorting bedGraph file: ./data_download/dna/T.bgr
Traceback (most recent call last):
  File "/rds/general/user/aemurphy/home/anaconda3/envs/tfpred_data_dwnld_parallel/bin/CrossMap.py", line 145, in <module>
    crossmap_wig_file(mapTree, in_file, out_file, targetChromSizes, in_format = 'bigwig')
  File "/rds/general/user/aemurphy/home/anaconda3/envs/tfpred_data_dwnld_parallel/lib/python3.9/site-packages/cmmodule/mapwig.py", line 105, in crossmap_wig_file
    for (chrom, start, end, score) in bgrMerge.merge(out_prefix + '.bgr'):
  File "/rds/general/user/aemurphy/home/anaconda3/envs/tfpred_data_dwnld_parallel/lib/python3.9/site-packages/cmmodule/bgrMerge.py", line 80, in merge
    (last_chr, last_start, last_end, last_score) = lines[-1].split()
IndexError: list index out of range

I would usually expect this error when I am using a crossmap conversion file which doesn't specify the correct genome build but the one-hot encoded DNA bigwigs are built on hg19 correct?

Oddly, C.bigwig did not fail. Is this something you have encountered? Or is there an alternative way for me to to create the A,T,G,C bigwig files for hg38? Alternatively, if you have a version of A, T, C, G and avg and liver bigwigs based on hg38 that would be great?

Thanks, Alan.

Hongyang449 commented 3 years ago

Hi Alan,

It is weird that C.bigwig worked but others failed and I never encountered this before. I checked the A,T,G,C bigwig files and the genome ranges are the same (hg19/grch37). Nevertheless, now I also provide the hg38/grch38 version and you can download them here: https://guanfiles.dcmb.med.umich.edu/Leopard/dna_grch38/ I directly generated these bigwig files from hg38 fasta instead of liftovering them. Let me know if they work for you or not.

Thanks, Hongyang

Al-Murphy commented 3 years ago

Hi @Hongyang449,

Thank you for the DNA bigwigs from hg38, that is very helpful. I did, however, note that although converting your avg DNASE bigwig to hg38 didn't produce an error, it did produce an error when trying to train the model. Is there any chance you have a hg38 version of the avg.bigwig which is at https://guanfiles.dcmb.med.umich.edu/Leopard/dnase_bigwig/avg.bigwig

Thanks, Alan.

Hongyang449 commented 3 years ago

Hi Alan,

I suggest you to generate the average bigwig of your data using this script: https://github.com/GuanLab/Leopard/blob/master/data/calculate_avg_bigwig.py You can run it like this: python calculate_avg_bigwig.py -i INPUT1.bigwig INPUT2.bigwig INPUT3.bigwig -o avg.bigwig -rg grch38

Thanks, Hongyang

Al-Murphy commented 3 years ago

Hey Hongyang,

Thanks for the suggestion and I think that probably makes more sense for my application anyway! However, I am still getting the same error when I create this file:

2021-08-06 09:31:03.361496: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at summary_kernels.cc:242 : Invalid argument: Nan in summary histogram for: UNet/initial_conv_layer/conv1d/kernel_0
Traceback (most recent call last):
  File "./train.py", line 174, in <module>
    model.fit(dna_dataset_train,
  File "/rds/general/user/aemurphy/home/anaconda3/envs/tf2_leopard/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 1145, in fit
  File "/rds/general/user/aemurphy/home/anaconda3/envs/tf2_leopard/lib/python3.8/site-packages/tensorflow/python/keras/callbacks.py", line 428, in on_epoch_end
  File "/rds/general/user/aemurphy/home/anaconda3/envs/tf2_leopard/lib/python3.8/site-packages/tensorflow/python/keras/callbacks.py", line 2339, in on_epoch_end
  File "/rds/general/user/aemurphy/home/anaconda3/envs/tf2_leopard/lib/python3.8/site-packages/tensorflow/python/keras/callbacks.py", line 2398, in _log_weights
  File "/rds/general/user/aemurphy/home/anaconda3/envs/tf2_leopard/lib/python3.8/site-packages/tensorflow/python/ops/summary_ops_v2.py", line 930, in histogram
  File "/rds/general/user/aemurphy/home/anaconda3/envs/tf2_leopard/lib/python3.8/site-packages/tensorflow/python/ops/summary_ops_v2.py", line 858, in summary_writer_function
  File "/rds/general/user/aemurphy/home/anaconda3/envs/tf2_leopard/lib/python3.8/site-packages/tensorflow/python/framework/smart_cond.py", line 54, in smart_cond
  File "/rds/general/user/aemurphy/home/anaconda3/envs/tf2_leopard/lib/python3.8/site-packages/tensorflow/python/ops/summary_ops_v2.py", line 852, in record
  File "/rds/general/user/aemurphy/home/anaconda3/envs/tf2_leopard/lib/python3.8/site-packages/tensorflow/python/ops/summary_ops_v2.py", line 923, in function
  File "/rds/general/user/aemurphy/home/anaconda3/envs/tf2_leopard/lib/python3.8/site-packages/tensorflow/python/ops/gen_summary_ops.py", line 479, in write_histogram_summary
  File "/rds/general/user/aemurphy/home/anaconda3/envs/tf2_leopard/lib/python3.8/site-packages/tensorflow/python/ops/gen_summary_ops.py", line 498, in write_histogram_summary_eager_fallback
  File "/rds/general/user/aemurphy/home/anaconda3/envs/tf2_leopard/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
tensorflow.python.framework.errors_impl.InvalidArgumentError: Nan in summary histogram for: UNet/initial_conv_layer/conv1d/kernel_0 [Op:WriteHistogramSummary]

To note this doesn't appear when I use the hg19 avg.bigwig file available from your repo. I am creating the average file off the raw bigwigs as opposed to the quantile normalised (which uses liver.bigwig) files, should I be using the quantile normalised files for this step instead? Or do you have any idea what else could be causing the issue?

Thanks, Alan.

Hongyang449 commented 3 years ago

Hi Alan,

I think the error is related to missing/nan values (e.g. the tail region of a chromosome) in the bigwig files. I've filled those nan positions with zeros in the hg19 avg.bigwig and the quantile normalization scripts automatically fill nan with zeros. I've updated the calculate_avg_bigwig.py to fix the nan issue. Regarding the avg.bigwig calculation, it's better to use the quantile normalized files to calculate the average.

Thanks, Hongyang

Al-Murphy commented 3 years ago

Thanks for all your help, this worked!