FASTA input files for Garnet

agitter commented 5 years ago

I'm summarizing an issue that @omigueles reported via email. I'm not familiar enough with the Garnet code to resolve it myself. @sgosline suggested it may be related to TAMO.

@zfrenchee do you know who in the Fraenkel lab is actively working with Garnet and may be able to help @omigueles?

A description of the issue from @omigueles:

I was exploring the fasta files in the A549 example and I got a couple of doubts, I would really appreciate if you could help me.

In the example file the fasta headers look like this:

>hg19_chr1_4709051_4709736_+ -1

Playing around, first I eliminated the "space-1" at the end, which causes the program to crash, which I understand since the program expects a header from Galaxy. I eliminated it because I wanted to understand what it means, is it the strand? I thought it was defined as + in this specific header. Besides in the bedfile it looks as if the strand was actually not known...
If I change the "space-1" to "space." ( .) which should be the standard when you do not know the strand, I get completely different results for this example (33 TFs instead of 75 TFs) do you know why this could be happening?
If I try with a toy example in Galaxy I see that the fasta header looks like this (even if you have a "point" in the strand field of the bed file):
```
>chr1:2-30 (+)
```
I completely understand that this could be caused by the Galaxy version used, and I can change the headers so that they look as in the example, but then what do I need to put in my header so that Garnet works properly? i.e. Would the next statements be correct?

if I know the strand :

hg19_chr1_2_30_+ +1
hg19_chr1_2_30_- -1

and if I do not know it:

hg19_chr1_2_30_+ .

Thank you for your time and your attention.

(next update:)

I re-checked Galaxy and actually you can get the same format for your header if you choose "character delimited field values" writing "_" . Interestingly the last piece of information is the fourth field of the BED file which is "name". I was also looking through some other scripts in the OmicsIntegrator website and found that in https://github.com/fraenkel-lab/OmicsIntegrator/blob/master/src/chipsequtil/Fasta.py the field after the whitespace in the header is used as Fasta ID. Could it be something like that? The program might be expecting unique IDs and by giving "-1" or "." all the time perhaps is doing something odd.

(next update:)

But my question is the same, how does the "-1" affect the output and why?

To begin with that should not be a "-1" , according to the Galaxy example it should be "peak1", maybe it is a bug in Galaxy. Still the thing is that if I run the example with this "-1" I get 75 TFs just like you do, according to:

https://github.com/fraenkel-lab/OmicsIntegrator/tree/master/tests/integration_test_standard/events_to_genes_with_motifsregression_results_FOREST_INPUT.tsv

but if I put a "." (as in "no name" in the bed file) I get 33 TFs and if I replace "-1" in every header with the "name" field from the bed file, that is "peak1,peak2... peak3108", I get 105 TFs.

This behavior is with OmicsIntegrator at commit 28d9a75, the v0.3.1 release.

One partial resolution is to update our readme to clarify that character delimited field values is required in Galaxy.

I've started to look at the Garnet code to see how the FASTA headers are used. Two instances are: https://github.com/fraenkel-lab/OmicsIntegrator/blob/08b55483b18aafa7e854c36f4920a98719c03332/scripts/get_window_binding_matrix.py#L23 where key_func is the identity function so the entire header is used as the id instead of only the second whitespace delimited token.

and https://github.com/fraenkel-lab/OmicsIntegrator/blob/08b55483b18aafa7e854c36f4920a98719c03332/scripts/get_window_binding_matrix.py#L39-L42 where it looks like the name for the sequence is not being used.

alexlenail commented 5 years ago

@agitter Sorry to say I don't believe any of us are using GarNet anymore. @iamjli might know more than me.

agitter commented 5 years ago

@sgosline do you have any ideas about where to look next in the Garnet code? I could help debug, but I haven't read through many of the Garnet and chipsequtil files before.

sgosline commented 5 years ago

Yeah, the Fastq files are read in by the old chip seq util code here. I believe it was last updated by Adam Labadorf but I'm not sure. Is this being maintained at all? I haven't been using Garnet in my research.

iamjli commented 5 years ago

Hi all, I have a strong feeling that this issue has something to do with how values are stored in dictionaries. When these dicts are keyed by different values (ie. hg19_chr1_4709051_4709736_+ -1 or hg19_chr1_4709051_4709736_+ .), the order of dict.values() changes.

Could someone familiar with the codebase follow up on this? @sgosline @agitter

agitter commented 5 years ago

@iamjli interesting observation. I'm not familiar with the Garnet codebase, but I'll note that dictionary ordering also caused test case failures in the Forest code long ago. In particular, msgsteiner was sensitive to the order in which information was presented.

iamjli commented 5 years ago

Is anyone planning to debug this? Otherwise we may need to deprecate this code.

sgosline commented 5 years ago

That might be wise, I don't have the bandwidth to debug until mid-November.

agitter commented 5 years ago

Is there an alternative workflow that the Fraenkel lab is using to predict transcription factor activities from epigenomic and transcriptional data? If we could offer a replacement, I would be more supportive of deprecating the Garnet code.

If we don't have a replacement, then we would be removing significant functionality only 2 years after the PLOS Comp Bio paper appeared and the same year as @AmandaKedaigle's protocol paper. In that case, I would prefer to wait until @sgosline has time to assess how difficult the debugging would be or recruit someone new to help maintain the code.

sgosline commented 5 years ago

Hi all, I'm finally getting to this now as I'm sitting in a lot of talks/planes this week. Sorry for the delay. I'm having an issue getting the code, created #32 to track. It might be my config or the fact that I'm in Europe, not sure.

sgosline commented 5 years ago

I'm still traveling but have narrowed the problem down to the scoring of the fasta files in motif_fsa_scores.py which Anthony Soltis wrote. I think it's either a dictionary key ordering issue or a threading issue. He's allegedly attending the meeting I'm traveling to so will see if he knows.

agitter commented 5 years ago

Thanks for the update @sgosline, that's encouraging.

If we need to, I would support making a behavior-breaking change that orders the dict elements as long as we bump the version number. I'd rather err on the side of having a stable implementation than matching past behavior given our limited time for maintenance. I'm not sure whether OrderedDict would be relevant here or if we should just sort the keys, which is a change we made several places in the Forest code to fix dict iteration inconsistencies.

sgosline commented 5 years ago

Who is administrator for this github repo? @zfrenchee ? Can you add Anthony Soltis to this project so he can participate?

alexlenail commented 5 years ago

He's invited. Anthony, check out: https://github.com/fraenkel-lab/OmicsIntegrator/invitations

On Thu, Nov 8, 2018 at 8:58 AM Sara JC Gosline notifications@github.com wrote:

Who is administrator for this github repo? @zfrenchee https://github.com/zfrenchee ? Can you add Anthony Soltis to this project so he can participate?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/fraenkel-lab/OmicsIntegrator/issues/31#issuecomment-437002217, or mute the thread https://github.com/notifications/unsubscribe-auth/ACojfR3NrEAwSbsu81nu0PDSQX2_5YlLks5utDh0gaJpZM4WY0Rp .

-- Alex Lenail Tufts University '16

fraenkel-lab / OmicsIntegrator

FASTA input files for Garnet #31