Speed improvements to Garnet map_peaks_to_genes.py script

Mkebede commented 7 years ago

Added a Balanced Binary Search tree data structure to make searching for genes within peaks easier.

AmandaKedaigle commented 7 years ago

Hey @agitter, @Mkebede is an undergrad in our lab so I've been working with her to make these improvements. We have tested them to make sure they produce the same results on our example data and some of my own data, but I would love to have a more established way to test changes to Garnet code, especially since she's planning to make more improvements in the coming months as well. @sgosline can you think of anything specific we should watch out for when designing tests, or do you think running it on example data (a549?) is sufficient?

By the way, here is my data on the speed improvements of this version for posterity. On our example a549 data, running on our lab cluster, the current version took 1h:12m:31s, while this version took 34m:52s. On my own data, the current version took 2m:2.6s, while this version took 18s.

agitter commented 7 years ago

@AmandaKedaigle thanks for the context. I first though this was a contribution from a benevolent stranger so I was treating it with gratitude and caution. No need to be so cautious now.

This might still be a good time to add our first formal Garnet test case. I have enough experience with the Forest testing that I can volunteer to write the test code. I would just need the python map_peaks_to_known_genes.py ... command to use and the reference output that we want to test against.

The complication is that I believe Travis CI has a time limit for the test cases, and the a549 Garnet run would exceed that limit. Are there other public datasets in the lab that we could use for a quick test case? Or can we truncate the a549 data?

AmandaKedaigle commented 7 years ago

That would be great, @agitter. I would recommend running through garnet.py, not just map_peaks_to_genes, if possible? Future improvements will be in other scripts that are called from garnet.py.

I just checked out the other example files we provide with OmicsI, and it looks like all of them with garnet files have an even larger genomic region bed files than a549 (the biggest time determinant). There are certainly plenty of new example datasets we could add, but maybe truncating the a549 genomic regions file is the way to go for simplicity.

For comparison, wgEncodeUWDukeDnaseA549.fdr01peaks.hg19.bed in the a549 folder has over 170,000 genomic regions to scan, while my own data that took 18sec to process had about 5,000.

agitter commented 7 years ago

I can just as easily run through all of garnet.py and compare all of the intermediate output files produced along the way. I initially suggested just one step because it would be easier to squeeze in the Travis CI time limit.

Should we use a ChIP-seq peak bed file for the test case? That should have far fewer genomic regions and would also serve as a positive control. If we don't get back that TF as the highest scoring TF from Garnet, we're in trouble.

AmandaKedaigle commented 7 years ago

Sounds like a good idea to me! To try it, I ran both new and old versions with file ENCFF000MZR from here: https://www.encodeproject.org/experiments/ENCSR000BPX/ with the same a549 expression data that's already in the example file. I picked FOXA1 because of this paper: http://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-14-680 and it has a small number of peaks. FOXA1 is not the highest scoring, but it's pretty far up there, and both full versions of Garnet took under 2 minutes to run.

agitter commented 7 years ago

That's perfect. It looks like the peak file isn't very large in terms of disk space, which is also good news.

Let's use the new test-garnet branch I made (3bbb94e4588ef0d546efd1d0bec2f4c1fe51cb5d) to formally establish this test case and then merge this pull request (after figuring out the third party license acknowledgement). In test-garnet can you please:

Add the FOXA1 ChIP-seq bed and fasta.gz files to example/a549
Add the Garnet config file you used to example/a549
Add all of the Garnet output files to tests/integration_test_standard. I could also run them myself given the input data and config file, but if you add them it will be a good test that I recover the same results on different hardware. That was a problem with Forest.

Once I have the input and reference output files, I'll write the test code.

AmandaKedaigle commented 7 years ago

Great, should be posted!

agitter commented 7 years ago

That looks great. I should be able to write the test case this week.

agitter commented 7 years ago

The Garnet integration test is ready, but it does not pass. I have ideas for how to fix the two problems I described in b2ddb36ccde69d2d473bdf45f80fbc8784bcd6fb though.

agitter commented 7 years ago

Hi @Mkebede and @AmandaKedaigle, should we finish off this pull request? I think the only minor changes we need are the copyright and license notifications I listed here https://github.com/fraenkel-lab/OmicsIntegrator/pull/24#discussion_r91755532

Mkebede commented 7 years ago

Hi @agitter and @AmandaKedaigle , I think I have committed all of the changes. Sorry for the delay and thank you for your patience.

Happy New Year Everyone!

agitter commented 7 years ago

@Mkebede Thanks, I'll check this out today or tomorrow and merge the pull request.

agitter commented 7 years ago

Hi @Mkebede I reviewed the changes. Thanks for adding the license notices. Can you please convert LICENSE-3RD-PARTY.pages to a plain text .txt or markdown .md file? It can be annoying to open .pages in Windows and GitHub can't preview it.

Also, the new Garnet test case failed. Details here indicate that events_to_genes.xls is different than it was before the AVL tree update. The /tests/integration_test_standard contains the reference version of events_to_genes.xls that you can use to assess what changed. We may need to relax the test case and not require a perfect file match, depending on what the differences are.

Mkebede commented 7 years ago

@agitter I recently got a Mac and am still getting used to the file formats, but I think I have successfully changed it to a .txt file now. Thanks for the catch.

As for the Garnet test case I think there was a problem with duplications in the original file. Essentially because the original method to find genes within peak regions went through each peak and mapped it to the genes within its range, it was able to map multiple peaks to the same gene. However, since we ultimately only cared about the genes - and not about the peaks they mapped to - I removed the duplicates in the modified map_peaks_to_genes.py file. I had written a python script that made a set of all of the genes that were found in the original approach - saved as an excel file on my pc - and the genes from the new approach - also saved on my pc. Then I compared the two sets using the built in difference() function for two sets and found that they were the same.

I could be wrong in assuming that we could use sets of just the geneIDs to compare the outputs, but otherwise I think it should work. I don't have much experience with the GitHub testing platform, so if you could point me to resources on how to navigate through the errors that are being produced then I can look into it further.

agitter commented 7 years ago

Hi @Mkebede thanks for changing the file format. Can you please commit the new .txt version when you have a chance?

I am not very familiar with this part of the code, if you and @AmandaKedaigle agree that the old version of events_to_genes.xls contained duplicate information, then we should switch the reference version of that file with this pull request. That way, future commits will run the test case against that new version and confirm we still get the expected output.

To overwrite the reference version, you can please run Garnet on your local machine using example/a549/tgfb_foxa1_garnet.cfg as the input. Then take the version of events_to_genes.xls that is output and use it to overwrite /tests/integration_test_standard/events_to_genes.xls. Commit that updated file in this branch. That's all that's needed to update the test case. Then when you commit again, Travis CI will automatically confirm that it generates a version of events_to_genes.xls that exactly matches yours.

@AmandaKedaigle do you know if this change to events_to_genes.xls affects our documentation anywhere? If so, we should update that as well.

AmandaKedaigle commented 7 years ago

Hmm, @Mkebede I just ran some checks, and it actually looks like your new script is the one that creates some duplicates, rather than the older version. However, since the duplicated lines won't affect the function of Garnet, I think this is fine. We can trade the speed upgrade for some duplicated lines in events_to_genes.xls (Another member of our lab is working an even more sweeping update of Garnet, including this script, so I think our time is better spent there rather than chasing this down). I just updated the version of that file in the test directory so that tests will pass, and I don't believe there needs to be any documentation upgrades. It looks like she did commit the .txt file already, so unless there's any more objections, @agitter or @Mkebede, I'll merge this PR tomorrow?

agitter commented 7 years ago

@AmandaKedaigle Sounds good to me, thanks for checking the differences in the output files. If you commit the new version of events_to_genes.xls in this branch then Travis CI will run the tests and confirm they pass before we merge. If you're confident it will pass, then it isn't a big deal.

Either way I agree we're ready to merge if you have the updated events_to_genes.xls file.

AmandaKedaigle commented 7 years ago

Thanks so much @agitter and @Mkebede! So glad to get this upgrade and test case in 👍

The tests are passing, but the main page still has a "build failing" label on it. Maybe it will change after some update time has passed, but if not, do you know how to fix it?

agitter commented 7 years ago

Awesome! Thank you both for the new features and updated test case.

The build icon must have been a caching problem. It shows build passing when I checked now and the test log shows that both integration tests (Forest and Garnet) are being run in test_integration.py.

agitter commented 7 years ago

I updated HISTORY.rst with these changes. Should we release Omics Integrator version 0.3.1? This pull requested added a fairly major new feature.

AmandaKedaigle commented 7 years ago

Sounds like a good idea to me!

agitter commented 7 years ago

The last release I attempted had some errors related to end of line characters, but I'll try again if no one else gets to it first. It may take me a while.

AmandaKedaigle commented 7 years ago

I compressed and updated the files in dist, and referenced 0.3.1 in README and HISTORY, let me know if there is anything we need to do to update versions

agitter commented 7 years ago

Thanks @AmandaKedaigle. I wasn't able to pull the changes locally due to our LFS issues but everything looks perfect on GitHub. I made a new tag and release.

Could you please also update the 0.2.0 reference and 0.3.0 link on the lab page http://fraenkel-nsf.csbi.mit.edu/omicsintegrator/ ?

Oddly, our tests failed when you updated the docs. It looks like a problem with Travis CI, not our code, so I'm going to ignore it.

AmandaKedaigle commented 7 years ago

Sounds good, website updated. Thanks!

fraenkel-lab / OmicsIntegrator

Speed improvements to Garnet map_peaks_to_genes.py script #24