dozmorovlab / TADCompare

Package for analysis and characterization of differential TADs
https://dozmorovlab.github.io/TADCompare/
Other
22 stars 2 forks source link

Input file generation and data conversion #7

Closed bostanict closed 1 year ago

bostanict commented 3 years ago

Hi,

I have n basic question here. I liked the tool a lot but having trouble with the input file generation.

I have the data from HICUP and then I can convert it to homer, Juicer, hicpe, gothic and fithic formats. Is there any way to generate the input from either one of these formats?

It tried to generate the contact matrix using homer tools (analyze HiC) but since the chromosomes are really large, it kills the process and fails. I prefer no to break the chrs into chunks and do one by one, so if there is an easier way, please let me know.

I also called the TAD and Loops using homer and they are in bed format, Anyway that I can use them directly into the pipeline?

Thanks a lot in advance

mdozmorov commented 3 years ago

I'm not familiar with the HICUP output format, but if it is text, it should be possible to convert it to the sparse matrix format. If you provide small subsets of the data, I can try to write a script. You can also try Juicer format - we provide instructions how to get it in TADcompare. https://www.bioconductor.org/packages/release/bioc/vignettes/TADCompare/inst/doc/Input_Data.html#working-with-.hic-files.

As for visualizing TADcompare results and external TAD data, there are many tools. Have a look at https://github.com/mdozmorov/HiC_tools#visualization, it depends on which programming environment you are most familiar with.

bostanict commented 3 years ago

Hi @mdozmorov for getting back to me so quick, here are the example outputs one from HOMER after HICUP

1       chrX    145207315       -       chrUn_KI270742v1        13359   +
2       chrUn_KI270742v1        11867   -       chrX    133811276       -
3       chrUn_KI270742v1        91413   -       chrUn_KI270742v1        87814   +
4       chrUn_KI270742v1        15353   +       chrX    26786267        +
5       chrUn_KI270742v1        15880   +       chrX    145101514       +
6       chrUn_KI270742v1        139275  +       chrUn_KI270742v1        134337  -
7       chrX    56351278        +       chrUn_KI270742v1        14399   -
8       chrX    55213225        -       chrUn_KI270742v1        37417   -
9       chrX    99319901        -       chrUn_KI270742v1        161027  +
10      chrX    112058665       +       chrUn_KI270742v1        176261  +

I also have the peak calls from HOMER which is in this format for TADs and LOOPs:

TADS:

chr6    43064999        43148998        chr6    43064999        43148998        255,255,0       2.928   2.928
chr20   60206999        60368998        chr20   60206999        60368998        255,255,0       1.911   1.911
chr11   128027999       128864998       chr11   128027999       128864998       255,255,0       2.372   2.372
chr13   97022999        97364998        chr13   97022999        97364998        255,255,0       2.879   2.879
chr3    20192999        22130998        chr3    20192999        22130998        255,255,0       1.943   1.943
chr2    172859999       172937998       chr2    172859999       172937998       255,255,0       2.323   2.323
chr20   20657999        20831998        chr20   20657999        20831998        255,255,0       2.883   2.883
chr6    33506999        33569998        chr6    33506999        33569998        255,255,0       1.634   1.634
chr9    90896999        91739998        chr9    90896999        91739998        255,255,0       3.010   3.010
chr1    60842999        61046998        chr1    60842999        61046998        255,255,0       2.175   2.175

Loops:

chr2    16770000        16773000        chr2    17667000        17670000        0,0,250 33.742222       2.244069
chr11   10626000        10629000        chr11   10734000        10737000        0,0,250 112.746944      1.957650
chr2    71301000        71304000        chr2    71511000        71514000        0,0,250 76.773333       2.257595
chr12   104208000       104211000       chr12   104355000       104358000       0,0,250 102.067778      2.029318
chr3    7410000 7413000 chr3    8109000 8112000 0,0,250 26.343333       1.601173
chr7    108972000       108975000       chr7    109824000       109827000       0,0,250 24.723333       1.768769
chr2    186996000       186999000       chr2    187449000       187452000       0,0,250 30.106667       1.589152
chr2    61992000        61995000        chr2    62775000        62778000        0,0,250 18.401917       1.665698
chr8    125118000       125121000       chr8    125697000       125700000       0,0,250 82.941667       2.588617
chr9    1461000 1464000 chr9    1608000 1611000 0,0,250 192.627778      2.516918

I am not sure if your TADCompare also works on Loops too but So looking forward to it,

thanks

mdozmorov commented 3 years ago

Something is incomplete. The Homer after HICUP has paired genomic coordinates but not interaction frequencies. What's HICUP output?

bostanict commented 3 years ago

The hiccup output is the read pairs in bam format.

On Fri, Apr 30, 2021, 8:45 PM Mikhail Dozmorov @.***> wrote:

Something is incomplete. The Homer after HICUP has paired genomic coordinates but not interaction frequencies. What's HICUP output?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dozmorovlab/TADCompare/issues/7#issuecomment-830476150, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEOXGSXFFJNZBCMKPUGSRULTLNFLVANCNFSM435LFAXA .

mdozmorov commented 3 years ago

Is there other file paralleling pairs of genomic coordinates? Interaction frequencies is a critical missing piece.

Definitely, BAM files are not Hi-C matrices. Again, I'm familiar with the Juicer, HiC-Pro, and HiCExplorer pipelines. Not sure what to do with HICUP bam files to extract interaction matrices. You may explorer HiCExplorer and FAN-C pipelines for that, but I would use them at the first place instead of HICUP.

bostanict commented 3 years ago

Thanks a lot, since you directed me to the .hic files , I could generate those and I can use it. If not successful, I will poke you again here. Thanks a lot

On Fri, Apr 30, 2021, 9:11 PM Mikhail Dozmorov @.***> wrote:

Is there other file paralleling pairs of genomic coordinates? Interaction frequencies is a critical missing piece.

Definitely, BAM files are not Hi-C matrices. Again, I'm familiar with the Juicer, HiC-Pro, and HiCExplorer pipelines. Not sure what to do with HICUP bam files to extract interaction matrices. You may explorer HiCExplorer and FAN-C pipelines for that, but I would use them at the first place instead of HICUP.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dozmorovlab/TADCompare/issues/7#issuecomment-830480751, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEOXGSTKBOVS44EZFIVEK4LTLNIMNANCNFSM435LFAXA .

bostanict commented 3 years ago

I was able to convert HiC files to the input and run the TADCompare, thanks a lot.

Can I use TADCompare for Loops DE analysis as well? How can we set it to detect Loops and does the Differential on them?

Thanks

mdozmorov commented 3 years ago

TADCompare does not distinguish TADs and loops. We call them "domains", it compares domain boundaries. Which will include boundaries of TADs and loops.

bostanict commented 3 years ago

Since the input is the same, how do you then distinguish if the call is TAD or Loop? based on Lenght and shape of interactions on the interaction matrix?

mdozmorov commented 3 years ago

It's a general question - how people distinguish TAD and loop boundaries. By length seems to be the most common, smaller (<100kb) may be considered loops, ladger (>100kb & <2Mb) may be TADs.

distilledchild commented 2 years ago

Hi,

I have n basic question here. I liked the tool a lot but having trouble with the input file generation.

I have the data from HICUP and then I can convert it to homer, Juicer, hicpe, gothic and fithic formats. Is there any way to generate the input from either one of these formats?

It tried to generate the contact matrix using homer tools (analyze HiC) but since the chromosomes are really large, it kills the process and fails. I prefer no to break the chrs into chunks and do one by one, so if there is an easier way, please let me know.

I also called the TAD and Loops using homer and they are in bed format, Anyway that I can use them directly into the pipeline?

Thanks a lot in advance

@mdozmorov @bostanict

Hi, I have the same issue, how I can convert valid read pairs bam file from HICUP pipeline to contact matrix. Could you give me some guide if you know please?

Thanks.

mdozmorov commented 2 years ago

We don't currently use HICUP. If you have just BAM files, https://hicexplorer.readthedocs.io/en/latest/content/tools/hicBuildMatrix.html can process them into .cool matrix files, and then https://hicexplorer.readthedocs.io/en/latest/content/tools/hicConvertFormat.html can extract text matrices. But then, I'd process the data as HiCExplorer recommends https://hicexplorer.readthedocs.io/en/latest/content/example_usage.html#

distilledchild commented 2 years ago

@mdozmorov Thank you so much! I will let you know after the work!