Ruitulyu / KAS-Analyzer

New computational framework to process and analyze KAS-seq and spKAS-seq data.
MIT License
10 stars 4 forks source link

t2t compatibility #6

Closed ecroot closed 11 months ago

ecroot commented 1 year ago

Is your feature request related to a problem? Please describe. I am trying to run KAS-Analyser with bwa mapping to the t2t (hs1) human genome assembly. The command I am using is: KAS-Analyzer KAS-seq -a bwa -t 10 -i ~/Data/ref_genomes/t2t/hs1.fa -e 150 -o NO2test -s hs1 -1 NO2_S11_R1_001_trimmed.fq.gz

After running successfully through the following stages:

The command gets stuck on the stage Transfer test_rmdup.bam into test.bed with bamToBed. with the error Error: Unable to open file /path/Programs/KAS-Analyzer/scripts/../blacklist/hs1-blacklist.bed. Exiting.

I have checked the blacklist directory, and there does not appear to be a t2t-related blacklist file available.

It is frustrating to get this far before the error occurs.

Describe the solution you'd like

  1. t2t compatibility, by either: a. incorporation of a relevant blacklist or other exclusion list (I'm not sure that there is an ENCODE blacklist for t2t yet, but there are other, similar efforts such as https://academic.oup.com/bioinformatics/article/39/4/btad198/7126418), and relevant methods to handle it b. skipping the blacklist step for t2t - this may be appropriate given that it is more complete than other builds
  2. earlier checks for a valid blacklist argument in the command, so that if an invalid genome version is requested, then the command fails immediately, with a relevant warning message
  3. documentation to clarify that t2t alignment is not (yet) supported
Ruitulyu commented 1 year ago

Hi, The error was caused by not supporting the t2t reference genome in the current KAS-Analyzer (blacklist file is not available). You can consider creating a blank blacklist file and put it in the blacklist repository. Best, Ruitu

On Thu, Nov 2, 2023 at 11:24 AM E Croot @.***> wrote:

Is your feature request related to a problem? Please describe. I am trying to run KAS-Analyser with bwa mapping to the t2t (hs1) human genome assembly. The command I am using is: KAS-Analyzer KAS-seq -a bwa -t 10 -i ~/Data/ref_genomes/t2t/hs1.fa -e 150 -o NO2test -s hs1 -1 NO2_S11_R1_001_trimmed.fq.gz

After running successfully through the following stages:

  • bwa mapping to my indexed t2t reference genome
  • samtools sort
  • samtools rmdup

The command gets stuck on the stage Transfer test_rmdup.bam into test.bed with bamToBed. with the error Error: Unable to open file /path/Programs/KAS-Analyzer/scripts/../blacklist/hs1-blacklist.bed. Exiting.

I have checked the blacklist directory, and there does not appear to be a t2t-related blacklist file available.

It is frustrating to get this far before the error occurs.

Describe the solution you'd like

  1. t2t compatibility, by either:

    • incorporation of a relevant blacklist or other exclusion list (I'm not sure that there is an ENCODE blacklist for t2t yet, but there are other, similar efforts such as https://academic.oup.com/bioinformatics/article/39/4/btad198/7126418), and relevant methods to handle it
    • skipping the blacklist step for t2t - this may be appropriate given that it is more complete than other builds
  2. earlier checks for a valid blacklist argument in the command, so that if an invalid genome version is requested, then the command fails immediately, with a relevant warning message

  3. documentation to clarify that t2t alignment is not (yet) supported

— Reply to this email directly, view it on GitHub https://github.com/Ruitulyu/KAS-Analyzer/issues/6, or unsubscribe https://github.com/notifications/unsubscribe-auth/APDPVGGJ3WU5ELQE73I3F23YCPCKDAVCNFSM6AAAAAA63FGR5SVHI2DSMVQWIX3LMV43ASLTON2WKOZRHE3TINRRGI4TINQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- Nutricula, based in Chicago.

ecroot commented 1 year ago

Hi Ruitu, thank you for your suggestion. Adding a blank hs1-blacklist.bed file to the KAS-Analyzer/blacklist directory worked to get around the blacklist error.

However, I then immediately ran into a related error at the next stage: Error: Unable to open file path_to/KAS-Analyzer/scripts/../chrom_size/hs1.chrom.sizes.bed. Exiting.

For anyone else experiencing this error, here's how I resolved it:

The files I created are fairly basic compared to the files for other genome builds (the files for the other genome builds contain information for various release updates, whereas for t2t the file I downloaded only has information for version). I have attached the files I used, in case they are of use to other KAS-Analyzer users who have experienced similar issues. hs1.chrom.sizes.zip

ecroot commented 1 year ago

Hi Ruitu,

I have another question/request regarding t2t compatibility.

The peakscalling command is also not compatible with t2t. The peak callers macs2 and epic2 seem to only require a genome to be specified for size estimation. Therefore, when using the t2t reference (i.e. the .bed input files have been generated for t2t), is it acceptable to use hg38 as an input for peakscalling, because it is close in size to t2t? Or will there be negative consequences for providing an inaccurate genome build version here?

Do you have any tips for how best to handle peakscalling for t2t?

Thank you for your help so far, Emmon

Ruitulyu commented 12 months ago

Hi, I don't think if you use hg38 as an input for peakscalling will have some bad consequences for the accuracy. Actually, the genome assembly you select just guide KAS-Analyzer to provide MACS2 or epic2 the relative effective genome size, which is just a rough calculation based on my understanding. Best, Ruitu

ecroot commented 11 months ago

Thanks for your help. These kas-seq and peakscalling issues are resolved now, so I will close this thread.