CostaLab / reg-gen

Regulatory Genomics Toolbox: Python library and set of tools for the integrative analysis of high throughput regulatory genomics data.
https://reg-gen.readthedocs.io/
Other
103 stars 30 forks source link

rgt-hint footprinting warning and error #166

Closed yufeiZ24 closed 1 year ago

yufeiZ24 commented 4 years ago

Hi,

I have run HINT v0.13.0 on bulk ATAC-seq data (skipping chrY and chrM) using command "rgt-hint footprinting --atac-seq --paired-end --organism hg19 --output-prefix "$p"_fp "$p".sorted.bam "$p"_peaks.narrowPeak".

But Warnings like "The scikit HMM encountered errors when applied. in region (chr1,10014,10555). This iteration will be skipped." keep appearing among all chromosomes, and the output bed file is an empty file.

Thank you very much.

lzj1769 commented 4 years ago

Hi @yufeiZ24 ,

This warning message means there is something wrong when rgt-hint tries to segment the signal using HMM, which usually takes place in the regions near the start or end of the chromosome. Usually, it can be ignored.

May I ask how you obtained the peaks? by macs2? also, how is your data quality, for example, the number of reads and Frip? With these, maybe I can try to figure out why you end up with an empty file.

Thanks, Z. Li

yufeiZ24 commented 4 years ago

Hi @lzj1769 ,

we used macs2 for peak calling. We have checked the number of reads and frip, and lower coverage files have been succeeded before.

We used hs37d5 from 10X genomics as reference in genome mapping step. Would this cause the empty file trouble? And if it does, we are also confusing by the two required bed files for HINT genome. What's the differences between RefSeq and Gencode when we set up our own reference?

Thank you.

AndHerr commented 3 years ago

Hi @lzj1769 ,

I'm facing the same problem (The scikit HMM encountered errors when applied) using the command:

rgt-hint footprinting --atac-seq --paired-end --organism=mm10 --output-location=./ --output-prefix=NT_1_S1 markdup.bam narrow_PEAKS/NT_1_S1_merge.narrowPeak

This is the content of the .info file:

Number of reads: 18481020 Number of peaks: 63453 Number of tag counts within peaks: 6809093.0 Number of footprints: 0

The .bed is empty.

I mapped with bowtie2, using pre-built bowtie index. bowtie2 -p $NSLOTS -k 1 -X 1000 --dovetail --very-sensitive -x INDEX -1 $FILE_PATH/_1.fastq.gz -2 $FILE_PATH/_2.fastq.gz -S $OUT_PATH/$FILE_NAME.sam

Then, I filtered by quality and proper paired with samtools view (-q 30 -f 0x2) and I revomed dupplicates with samtools markdup (-r).

Finally, I run MACS2: macs2 callpeak -t markdup.bam -f BAM -g mm -n NT_1_S1.peaks --outdir $OUT_PATH/ --nomodel --nolambda --keep-dup auto --call-summits -q 0.05

And merge narrowPeakss with betools before to use:

rgt-hint footprinting --atac-seq --paired-end --organism=mm10 --output-location=./ --output-prefix=NT_1_S1 markdup.bam narrow_PEAKS/NT_1_S1_merge.narrowPeak

About data quality (FASTQC):

Red problems: Per base sequence content , Per base GC content (which I think is common for ATACseq) Yellow problems with Sequence Duplication Levels, Kmer Content

-- version HINT - Regulatory Analysis Toolbox (RGT) - v0.13.1

I hope this is enough information, but please let me know if you need further information. Thanks, Andrés

lzj1769 commented 3 years ago

Hi @AndHerr,

What is your python version? Also, is there any error message during rgt-hint running? can you check the statistics of the bam file by samtools flagstat in.bam and provide these information?

Thanks, Li

AndHerr commented 3 years ago

Hi @lzj1769 ,

I solved it, it was an issue about the genome. I used a mm10 genome I had instead of getting it by setupGenomicData.py.

Now, it worked nicely, sorry for the silly question. I just became aware that this could be the problem while waiting for your reply.

Thanks, Andrés

chickenflyingtosky commented 1 year ago

Hi @lzj1769, I'm facing the same problem with tair10. All of the peaks were skipped, and get no footprint. I get genome files using python setupGenomicData.py --tair10. And if i use the data from tutorial, it can run finely. Did the problem caused by my bam and peaks files?

lzj1769 commented 1 year ago

Hi @chickenflyingtosky ,

Is there any warning message generated?

Best, Zhijian

chickenflyingtosky commented 1 year ago

Hi @lzj1769, The warning message is like yufeiZ24's problem, but the The "scikit HMM encountered errors when applied. in region (chr1,10014,10555) " covers all of my peaks in bed file. And i have solved this problem. That's because the name of chromosome in the genome file created by setupGenomicData.py is like "chr1""chr2", but that in my bam file is "1""2". Finally i changed chromosome name to "1""2" and the program can run correctly. Thank you very much for your reply.

lzj1769 commented 1 year ago

Hi @chickenflyingtosky

Glad to see the problem was solved.

but that in my bam file is "1""2". Finally i changed chromosome name to "1""2" and the program can run correctly.

If I remember correctly, this usually happens when using an alignment tool like BWA.