XiaoTaoWang / HiCLift

A fast and efficient tool for converting chromatin interaction data between genome assemblies
Other
63 stars 4 forks source link

Only used for cool format to pairs format transformation #1

Closed wbszhu closed 1 year ago

wbszhu commented 2 years ago

Hi, I have implemented the transformation from cool format to pairs format using the following command.

for i in LW
do 
for res in 5000 10000 40000
do
pairLiftOver --input ${i}.balance.mcool::/resolutions/${res}  --input-format cooler  --out-pre ${i}_${res} --output-format pairs --out-chromsizes ~/22.genome/susScr11_xy/pig11.1_from_star_chrom.sizes --chain-file  pig_chain.txt --in-assembly ~/22.genome/susScr11_xy/Sus_scrofa.Sscrofa11.1.dna.toplevel.fa --out-assembly ~/22.genome/susScr11_xy/Sus_scrofa.Sscrofa11.1.dna.toplevel.fa --logFile pairLiftOver.log
done
done

with a fake --chain-file pig_chain.txt ,like this.

head pig_chain.txt 
chain   274330532   chr1    274330532   +   0   274330532   chr1    274330532   +   0   274330532   1
274330532
chain   151935994   chr2    151935994   +   0   151935994   chr2    151935994   +   0   151935994   2
151935994
chain   132848913   chr3    132848913   +   0   132848913   chr3    132848913   +   0   132848913   3
132848913
chain   130910915   chr4    130910915   +   0   130910915   chr4    130910915   +   0   130910915   4
130910915
chain   104526007   chr5    104526007   +   0   104526007   chr5    104526007   +   0   104526007   5
104526007
...

and the pairLiftOver.log

root                      INFO    @ 02/22/22 19:15:28: 
# ARGUMENT LIST:
# Input path = LW.balance.mcool::/resolutions/40000
# Input format = cooler
# Output prefix = LW
# Output format = pairs
# Chromosome Sizes of the output assembly = /public/home/luzhang/22.genome/susScr11_xy/pig11.1_from_star_chrom.sizes
# Generate contact maps at 11 resolutions = False
# Input assembly = /public/home/luzhang/22.genome/susScr11_xy/Sus_scrofa.Sscrofa11.1.dna.toplevel.fa
# Output assembly = /public/home/luzhang/22.genome/susScr11_xy/Sus_scrofa.Sscrofa11.1.dna.toplevel.fa
# Chain file = pig_chain.txt
# Temporary Dir = .pairliftover
# Allocated memory = 8G
# Number of Processes = 8
# Log file name = pairLiftOver.log
pairLiftOver.utilities    INFO    @ 02/22/22 19:15:40: Writing headers ...
pairLiftOver.utilities    INFO    @ 02/22/22 19:15:40: Converting, sorting, and compressing ...
pairLiftOver.utilities    INFO    @ 02/22/22 19:48:17: 124,244,699 / 124,268,605 pairs were uniquely mapped to /public/home/luzhang/22.genome/susScr
11_xy/Sus_scrofa.Sscrofa11.1.dna.toplevel.fa
pairLiftOver.utilities    INFO    @ 02/22/22 19:48:17: Done
root                      INFO    @ 02/22/22 19:55:54: 

the result like this

## pairs format v1.0.0
#shape: upper triangle
#genome_assembly: /public/home/luzhang/22.genome/susScr11_xy/Sus_scrofa.Sscrofa11.1.dna.toplevel.fa
#chromsize: chr1 274330532
#chromsize: chr2 151935994
#chromsize: chr3 132848913
#chromsize: chr4 130910915
#chromsize: chr5 104526007
#chromsize: chr6 170843587
#chromsize: chr7 121844099
#chromsize: chr8 138966237
#chromsize: chr9 139512083
#chromsize: chr10 69359453
#chromsize: chr11 79169978
#chromsize: chr12 61602749
#chromsize: chr13 208334590
#chromsize: chr14 141755446
#chromsize: chr15 140412725
#chromsize: chr16 79944280
#chromsize: chr17 63494081
#chromsize: chr18 55982971
#chromsize: chrX 125939595
#chromsize: chrY 43547828
#chromsize: chrMT 16613
#columns: readID chrom1 pos1 chrom2 pos2 strand1 strand2
#pairLiftOver: coordinates transformed from /public/home/luzhang/22.genome/susScr11_xy/Sus_scrofa.Sscrofa11.1.dna.toplevel.fa
.       chr1    3       chr1    104689810       .       .
.       chr1    8       chr1    524475  .       .
.       chr1    14      chr1    3198    .       .
.       chr1    34      chr1    64317   .       .
.       chr1    59      chr1    16490   .       .
.       chr1    90      chr1    6702    .       .
.       chr1    92      chr1    23338   .       .
.       chr1    106     chr1    3909613 .       .
.       chr1    125     chr1    8836    .       .
.       chr1    151     chr1    12468   .       .
...

Am I right? would you consider adding a simple function to make it work?

Nanguage commented 2 years ago

I think this tool will also be useful for format conversion. Maybe add a parameter(--conversion-only) to represent pure format conversion instead of liftover.

XiaoTaoWang commented 2 years ago

@Nanguage @wbszhu this is a good idea. I will add this function and keep you updated!

XiaoTaoWang commented 2 years ago

Hi all, I've added this function to v0.1.3. To perform a pure format conversion without liftover, just make sure you specify --in-assembly and --out-assembly to the same assembly name. Here is an example command to transform a contact matrix from the .cool format to the .hic format::

$ pairLiftOver --input Rao2014-K562-MboI-allreps-filtered.5kb.cool --input-format cooler \
--out-pre K562-format-conversion-test --output-format hic --out-chromsizes hg19.chrom.sizes \
--in-assembly hg19 --out-assembly hg19 --memory 40G

Let me know your feedback!

wbszhu commented 2 years ago

The feedback is coming~ using the following command.

for i in LW-2 
do 
for res in 40000
do
pairLiftOver --input ../${i}.balance.mcool::/resolutions/${res}  --input-format cooler  --out-pre ${i}_${res} --output-format pairs --out-chromsizes ~/22.genome/susScr11_xy/pig11.1_from_star_chrom.sizes --in-assembly ~/22.genome/susScr11_xy/Sus_scrofa.Sscrofa11.1.dna.toplevel.fa --out-assembly ~/22.genome/susScr11_xy/Sus_scrofa.Sscrofa11.1.dna.toplevel.fa --logFile pairLiftOver.log
done
done

pairLiftOver.log

root                      INFO    @ 03/01/22 11:11:07: 
# ARGUMENT LIST:
# Input path = ../LW-2.balance.mcool::/resolutions/40000
# Input format = cooler
# Output prefix = LW-2_40000
# Output format = pairs
# Chromosome Sizes of the output assembly = /public/home/luzhang/22.genome/susScr11_xy/pig11.1_from_star_chrom.sizes
# Generate contact maps at 11 resolutions = False
# Input assembly = /public/home/luzhang/22.genome/susScr11_xy/Sus_scrofa.Sscrofa11.1.dna.toplevel.fa
# Output assembly = /public/home/luzhang/22.genome/susScr11_xy/Sus_scrofa.Sscrofa11.1.dna.toplevel.fa
# Chain file = None
# Temporary Dir = .pairliftover
# Allocated memory = 8G
# Number of Processes = 8
# Log file name = pairLiftOver.log
root                      INFO    @ 03/01/22 11:11:27: Trying to perform a pure format conversion without liftover ...
pairLiftOver.utilities    INFO    @ 03/01/22 11:11:27: Writing headers ...
pairLiftOver.utilities    INFO    @ 03/01/22 11:11:27: Dumping contact pairs from ../LW-2.balance.mcool::/resolutions/40000 ...
pairLiftOver.utilities    INFO    @ 03/01/22 11:16:17: Done

zcat LW-2_40000.pairs.gz |head

## pairs format v1.0.0
#shape: upper triangle
#genome_assembly: /public/home/luzhang/22.genome/susScr11_xy/Sus_scrofa.Sscrofa11.1.dna.toplevel.fa
#chromsize: chr1 274330532
#chromsize: chr2 151935994
#chromsize: chr3 132848913
#chromsize: chr4 130910915
#chromsize: chr5 104526007
#chromsize: chr6 170843587
#chromsize: chr7 121844099

zcat LW-2_40000.pairs.gz |grep -v "#"|head

.   chr1    20000   chr1    20000   .   .
.   chr1    20000   chr1    20000   .   .
.   chr1    20000   chr1    20000   .   .
.   chr1    20000   chr1    20000   .   .
.   chr1    20000   chr1    20000   .   .
.   chr1    20000   chr1    20000   .   .
.   chr1    20000   chr1    20000   .   .
.   chr1    20000   chr1    20000   .   .
.   chr1    20000   chr1    20000   .   .
.   chr1    20000   chr1    20000   .   .
...

zcat LW-2_40000.pairs.gz |grep -v "#"|tail

.   chrY    43500000    chrY    43500000    .   .
.   chrY    43500000    chrY    43500000    .   .
.   chrY    43500000    chrY    43500000    .   .
.   chrY    43500000    chrY    43500000    .   .
.   chrY    43500000    chrY    43500000    .   .
.   chrY    43500000    chrY    43533914    .   .
.   chrY    43500000    chrY    43533914    .   .
.   chrY    43533914    chrY    43533914    .   .
.   chrY    43533914    chrY    43533914    .   .
.   chrY    43533914    chrY    43533914    .   .
...

as i do in the last comment.

zcat ../pairs/LW-2_40000.pairs.gz |grep -v "#"|wc -l
63809963

and i do in this comment.

zcat LW-2_40000.pairs.gz |grep -v "#"|wc -l
63820471

And i check the contents of two files, they are totally different.

XiaoTaoWang commented 2 years ago

I think this is reasonable, I cannot guarantee the accuracy of your previous results because they were based on a fake chain file and an unnecessary liftover (which might induce errors). And this new version doesn't do any liftover at all, and it simply extracts contact pairs from your cool file.

wbszhu commented 2 years ago

Thanks, I think you are right. LuZhang