Boyle-Lab / SEMpl

C++ implementation of the SEM algorithm
15 stars 2 forks source link

.pwm files #19

Closed annahutch closed 3 years ago

annahutch commented 3 years ago

Hi,

I would like some guidance as to where to find the .pwm files that the software requires. In the manuscript it states: "We obtained ChIP-seq and DNase-seq data from the ENCODE project and PWMs from the JASPAR, Transfac, UniPROBE and Jolma databases" but I can't find any .pwm files on these websites that look similar to the one used in the software example (MA0114.1.pwm). Indeed, UniProbe has an option to download the .pwm (http://thebrain.bwh.harvard.edu/uniprobe/details2.php?id=66) but using this in SEMpl gives the following error:

*** Error in ./iterativeSEM: double free or corruption (out): 0x000000000178c870 ***

Thanks in advance, Anna

aboyle commented 3 years ago

I believe that this is transfac format for PWMs. It is just a text file and the extension isn't important to the program. Here is the example pwm that is provided:

DE HNF4A MA0114.1 0 28 7 27 5 X 1 2 2 56 7 X 2 12 4 35 16 X 3 5 23 20 19 X 4 3 51 4 9 X 5 59 1 3 4 X 6 53 2 10 2 X 7 56 1 8 2 X 8 4 4 58 1 X 9 6 2 33 26 X 10 3 22 11 31 X 11 4 49 5 9 X 12 42 7 10 8 X XX

You should be able to download these from JASPAR. For example the transfac link in: http://jaspar.genereg.net/matrix/MA0006.1/

I also saw your other issue comment that seemed to show that you had not gotten the index working yet but I can't see it on github. Please reopen that if you are still having trouble and I'll comment on it.

annahutch commented 3 years ago

Hi Alan,

Thanks for your response. I believe that some post-processing must have been done on the files. For example downloading the .transfac file for MA0114.1 as in the example (http://jaspar.genereg.net/api/v1/matrix/MA0114.1.transfac) and then running SEMpl:

./iterativeSEM -PWM examples/MA0114.1.transfac -merge_file examples/wgEncodeOpenChromDnaseHepg2Pk.narrowPeak -big_wig examples/wgEncodeHaibTfbsHepg2Hnf4asc8987V0416101RawRep1.bigWig -TF_name HNF4A -genome data/hg19/hg19 -output results/HNF4A

gives the following error:

Running Iterative SEM building..
    PWM: examples/MA0114.1.transfac
    merge_file: examples/wgEncodeOpenChromDnaseHepg2Pk.narrowPeak
    bigwig: examples/wgEncodeHaibTfbsHepg2Hnf4asc8987V0416101RawRep1.bigWig
    TF_name: HNF4A
    genome_file: data/hg19/hg19
     output: results/HNF4A
    PWM
Estimated ideal kmer count (500k hits): 0
---Iteration 0---
    PWM
iterativeSEM: src/get_threshold.cpp:23: double get_threshold(Dataset&, double): Assertion data.TFM_data.letter_array[0].size() > 0 failed.
Aborted

This is obviously an issue with the motif file as it runs fine when using the supplied MA0114.1.pwm file.

annahutch commented 3 years ago

Specifically, I want to use the motif for IKZF1. The correct version is on HOCOMOCO (a different motif is reported on Jaspar for some reason):

https://hocomoco11.autosome.ru/motif/IKZF1_HUMAN.H11MO.0.C

I've even copied and pasted the values in the PWM at the bottom of the page into a file with exactly the same formatting as MA0114.1.pwm in the example, but no luck. The error reads:

    PWM
0   -1  0   0   0   
Estimated ideal kmer count (500k hits): 0
---Iteration 0---
    PWM
0   -1  0   0   0   
row size 4 col size 1
-1  
0   
0   
0   
*** Error in `./iterativeSEM': free(): invalid next size (normal): 0x0000000001bf4f20 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x81299)[0x2b48af3c7299]
./iterativeSEM[0x42914c]
./iterativeSEM[0x408669]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b48af368555]
./iterativeSEM[0x40a8cf]
aboyle commented 3 years ago

Yes, you'll need to convert the pcm from hocomoco to transfac format. I did it for you here:

IKZF1_HUMAN.H11MO.0.C.pcm.zip

annahutch commented 3 years ago

Unfortunately this does not work, see below for the error:

*** Error in `./iterativeSEM': malloc(): memory corruption (fast): 0x0000000001ab7740 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x7f3e4)[0x2b8b60f9e3e4]
/lib64/libc.so.6(+0x82b20)[0x2b8b60fa1b20]
/lib64/libc.so.6(__libc_malloc+0x4c)[0x2b8b60fa46fc]
/usr/local/software/archive/linux-scientific7-x86_64/gcc-9/gcc-9.3.0-qszxcci5frtw4aul3m44oarpvxzyrgpp/lib64/libstdc++.so.6(_Znwm+0x15)[0x2b8b604b45e5]
/home/ah2011/rds/hpc-work/SEMpl/lib/libTFMpvalue.so(_ZN6Matrix21computesIntegerMatrixEdb+0x356)[0x2b8b5fdf8056]
./iterativeSEM[0x428f6c]
./iterativeSEM[0x408669]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b8b60f41555]
./iterativeSEM[0x40a8cf]
aboyle commented 3 years ago

Ah I think they will all need to be integers as well.

annahutch commented 3 years ago

Many thanks - it now works.