SchulzLab / STARE

TF analysis from epigenetic and Hi-C data
MIT License
16 stars 2 forks source link

Genes not present in tfbs enrichment output #4

Closed Andi964 closed 1 year ago

Andi964 commented 1 year ago

Hello!

I ran now the full STARE pipeline on my dataset and generated reasonable ABC-scoring results. I played a little bit with gtf input and discovered, that some genes are not in the tfbs enrichment file and also not in the corresponding discarded genes file. However if i limit STARE to those genes (-u), I get them in my output file. Furthermore I discovered that I ALWAYS get several "thousands" (e.g. 10000, 60000, 3000) of output rows in the tfbs enrichment file, independent of different sets of input gtf files. Could it be that the code somewhere smooths the results down to "X-thousand" lines, thereby skipping some genes? If yes...how to solve it?

Best Andreas

DennisHeck commented 1 year ago

Hey, That is odd. There should not be any fixed number of lines. For summarising the TF affinities what the script does is go through each line in the gtf-file and see if the 3rd column says 'gene'. Then it checks if there was any set of genes defined via -u and skips genes that are not found in there. That's why genes should theoretically not appear in the output only when given via -u. From the remaining set of genes it looks for peaks that can be mapped to each gene, either via a defined window or preceding ABC-scoring, and should write a line to output for each. Those genes without any mappable peaks should be in the discarded genes file. So far, we always used the gtf-files from GENCODE, maybe there are others where the format differs. Another issue could be non-unique gene-names or IDs. Could you maybe provide the first few lines of your gtf-file and the -u file? And the command how you used STARE? Best, Dennis

Andi964 commented 1 year ago

GTF (from Gencode): gtf_short.txt

Here my code: STARE.sh -b ../../data/abc_act/EnhancerList_act.bed -n 4 -a ../../data/reference/gencode.v43.annotation.gtf -o ../../data/adabc_0.02_pred/adabc_0.02 -f ../../data/hic/hic_adabc -g ../../data/reference/hg38.fa -s ../../data/jaspar/jaspar_2022.transfac.txt -k 5000 -t 0.02 -c 8

In the -u file I just put a single gene name like "GAPDH".

Furthermore, I tried to read a little bit through ur code and it seems like you define batch sizes of 1000 in line 651 of: TF_Gene_Scorer_Reshape.cpp

Best Andreas

DennisHeck commented 1 year ago

The batch size of 1000 is for output writing only, meaning the data for 1000 genes are collected and then written to output in one go. I hope there are no corner cases where genes are skipped there, but that wouldn't explain the issue here. I tested small examples but unfortunately couldn't reproduce what you describe. I took the v43 gencode annotation, ran STARE both with and without ABC-scoring (using the contact estimate based on distance), while using a bed-file with two regions: chr12 6534512 6534912 10 chr12 6534012 6534412 20 And for me it seems to work fine. I get the genes whose window overlaps the peaks in both the ABC-scoring files as well as the TF-Affinity file, and when giving GAPDH as -u, I only get an entry for that. Also, when giving GAPDH via -u, the discarded genes file is empty, as all filtered genes are present in the output. The non-ABC version: STARE.sh -b Peak_Test.txt -n 4 -a gencode.v43.annotation.gtf.gz -o WoABC_GAPDH -s PWMs/2.2/Jaspar_Hocomoco_Kellis_human_transfac.txt -g hg38.fa -u GAPDH_Test.txt And the ABC-version: STARE.sh -b Peak_Test.txt -n 4 -a gencode.v43.annotation.gtf.gz -o WABC_GAPDH -s PWMs/2.2/Jaspar_Hocomoco_Kellis_human_transfac.txt -g hg38.fa -f False -t 0.02 -k 5000 -u GAPDH_Test.txt

Could you try to reproduce these tests? Just to be sure, you're using the most recent release? Were you able to successfully run the test the test cases ?

Andi964 commented 1 year ago

Yes the test cases work fine. However the problem ONLY occurs if there are more than 1000 output rows... So with small test files one can not reproduce this issue. Only by using e.g. the whole chr1 (error occurs also in this case --> The first X-Hundred genes on this chromosome are missing) or the whole genome.

The ABC score results contain the "missing" genes and also nearby and strong peaks are present. Actually the problem looks to me like if there is an issue in TF_Gene_Scorer_Reshape.cpp (cutting to thousands), since all other steps but this (which is the last one) work. Furthermore the -u outcome indicates to me that All TFBS calculations for those genes are actually there and also correct --> It seems to me rly like an issue directly before/at writing the output file.

My code is up to date and downloaded from conda

Best Andreas

DennisHeck commented 1 year ago

I found the issue, seems kind of obvious to me now. It really were the batches when writing to output. The code is updated, and once GitHub lets me publish a new release, it should also be updated in a new bioconda version. Thanks for raising the issue and helping to pin it down! Best wishes, Dennis

Andi964 commented 1 year ago

Many thanks for your quick help :) Do you have any idea when the updated code will be available ?

Best Andreas

DennisHeck commented 1 year ago

With bioconda it can sometimes take up to a week, as the recipe has to be manually approved. The fastest for usage would be to clone the repository and compile it manually, but that can of course be quite a hustle with the dependencies.

Andi964 commented 1 year ago

Many thanks for the information! I will keep u up to date if the issue is solved by your fix :)

Best Andreas

DennisHeck commented 1 year ago

Hey Andreas, The new version is now also available via bioconda. Thanks for being so attentive with the results! Let me know if you encounter any other issue. Best wishes, Dennis