cgat-developers / cgat-apps

cgat-apps repository
Other
33 stars 14 forks source link

Ac refactor #122

Closed Acribbs closed 6 months ago

Acribbs commented 6 months ago

@IanSudbery @sebastian-luna-valero @david-cgat @AndreasHeger @jscaber and everyone.. I realised today that pandas and numpy has features deprecated that caused our bam2bam to fail (plus other scripts). While fixing these issues I realised how bloated the cgat-apps is and have made an effort to reduce some code (Starting with scripts) so it can be better maintained. Im going to merge but please take a look at scripts that I have removed and if there are any that you think should be returned then let me know and I can revert some commits.

IanSudbery commented 6 months ago

I'm not against retiring old, unused scripts, but I'm not sure just pulling them without any process or warning is a good way forwards.

I suggest we instigate some sort of life cycle policy. Mark tools with a deprecation warning some period before they are removed. We should probably also leave a stub in that says the tool has been removed when called, rather than just throwing an error.

I can't speak to all the things removed here, but I know that we use bam2peakshape, which is useful for viewing any continuous signal over discrete regions. We've used it with ATAC-seq, TT-seq, mNET-seq just off the top of my head. Its also the example that we use in the paper to demonstrate the capabilities of the tool set.

I'd also like to speak up for combine_tables, which I've used a lot in the past. I wouldn't assume that pandas is more efficient, as pandas is often very inefficient when it comes to memory usage. I also believe that in --cat mode (in in how it is/was used in P.concatenateAndLoad), there is no reason why combine_tables shouldn't use effectively zero memory. The final advantage of these sorts of things, is while it might be true that there are now good ways of achieving the same thing in python (or R), sometimes you just want something you can throw in a bash statement without writing a new python script.

That said, I don't know how much anything is used these days, as I personally don't do much actual coding, and people in my lab find discovery a problem (where as most of this stuff is just in my head).

Acribbs commented 6 months ago

Happy to re-introduce anything people find useful, I just dont have too much time to fix the updates that come from maintaining all the code now (Although I still use cgat-apps a lot), so reducing as much as possible would be best. It took a while to get the code up to standard to support python >3.9 today, lots of fixes related to updates in numpy and pandas.

The reason for removing the bam2peakshape was because I thought that it was more ChIP-seq pipeline specific, but will revert as I wasn't sure if its used generally.

Regarding P.concatenateAndLoad, this should be standalone code now in cgat-core and it should be completely independent of cgat-apps, should be called P.concatenate_and_load now I think.

Acribbs commented 6 months ago

The worrying thing was that we have a python >3.8 build on conda and it is likely broken because the code hasnt been maintained for a while. It was only today that I decided to give later versions of python (I usually pin to py 3.8) a try and found them broken. Then I went down a rabbit hole of trying to fix everything.

IanSudbery commented 6 months ago

I completely get where you are coming from! I'd love to say that I would put effort into maintaining them, but I know I'd never follow though.

And I am sure there is lots of deadwood in there.

I just feel like we should have some sort of defined process for removing things that are no longer used by anyone.

When you say broken, do you mean they error, or that they no longer pass the regession tests?

Acribbs commented 6 months ago

Ok see your point, from now on for scripts I can initiate a deprecation cycle.

For scripts that you think are useful that I removed these can easily be reversed as i made sure i made one commit for each script removal. Or if you have pipelines that break, use a previous pin to conda for time being.

The code was broken for any python version >3.8 because of changes to pandas and numpy libs. We only tested on github actions up to py3.8 so none of this was caught (I now test up to py10, but think we should probably build and test for up to py3.12 too). bioconda had a build for python 3.10 released for our old code and would break if used. The tests on bioconda are basic so this needs to be picked up in our testing. Thanksfully, bioconda blacklisted cgat-apps anyway because the builds were consuming too much RAM, because of lots of high dependancies. I did a code review of the other bits yesterday and have realised that there is so much code that just does nothing. I suspect that if I dont clean that up the code then we will get blacklisted again and its a pain to manually keep releasing cgat-apps recipe.

Acribbs commented 6 months ago

Example deprecation warning given in #124, this ok?

IanSudbery commented 6 months ago

I think the deprecation messages are fine. I would roll a release before the deprecation goes ahead. Kind of a "this is the last release that will contain these functions".

We should probably keep a table of what was deprecated and when.

When we do the deprecation, we should probably replace the removed scripts/code with stubs that just print a deprecation message that then raise an error (your warnings could be replaced with errors for e.g., and the code itself just replaced with pass), for, say, a year.

We definately don't want to be black listed on bioconda, so we need to do what it takes for that not to happen.

What do you mean by "does nothing"?

You mean functions that are not called elsewhere in cgat-apps or in the cgat-flow, or functions that literally just return their parameters unalter etc?

Acribbs commented 6 months ago

Good idea regarding the table and can create one in the repo. Going to work on the code changes now and see if I can remove some dependancies. already found a few such as futures, jinja2 and six.

Some code such as https://github.com/cgat-developers/cgat-apps/blob/c931a5d6d0a3f870fd849a1ee692028f784096b6/cgat/CBioPortal.py#L1, https://github.com/cgat-developers/cgat-apps/blob/c931a5d6d0a3f870fd849a1ee692028f784096b6/cgat/IGV.py#L1, https://github.com/cgat-developers/cgat-apps/blob/c931a5d6d0a3f870fd849a1ee692028f784096b6/cgat/SVGdraw.py#L1 all completely redundant and the last one hasnt had a major update since 11 years ago. Its code that isn't used anywhere and doesnt seem to have any useful general functionality.

There are other bits of code that I am unsure if are useful, like this: https://github.com/cgat-developers/cgat-apps/blob/c931a5d6d0a3f870fd849a1ee692028f784096b6/cgat/MEME.py#L4. You updated last in 2020 so I assume its used in some capacity, but doesnt seems to be used in any scripts.

Acribbs commented 6 months ago

Also various files such as install.sh (broken anyway), PKG-INFO, COPYING (not needed as have license) seem to be redundant. And then there is the documentation, which is a whole other story of messiness which would take a while to fix, the latest read the docs is broken so would need to fix that first, or host them on github.

IanSudbery commented 6 months ago

Yes, I think it depends how you use cgat-apps.

For me personally, I always used cgat-apps as a programming API as well as a collection of scripts. I probably import modules from cgat more often than I run the scripts (not that I do either much these days) - the GTF and Bed modules in particular are absolutely gold, and remarkably provide functionality I'm not aware of existing anywhere else.

Some of these things (like CBioportal and MEME) could conceivably be standalone packages - they are not in cgat-apps because they are used in either the scripts or the pipelines (although I think MEME is in the motifs pipeline, if that still exists?), but because they provide functionality to be used externally in their own right (We've used it in several pipelines). The other two I'm not sure about, but I suspect the story is similar. IGV.py I'm pretty sure was used in one of Andreas' original project pipelines - back in the day when "cgat" was just a directory on /ifs where everyone put all their code, and was included in cgat-apps because, well, someone might find it useful.

You are right about the documentation, and this is a real barrier to anyone who doesn't know the code intimately from using cgat-apps. I'm pretty sure that those two modules above are entirely undocumented, so there is not much point in them existing (except that I still remember they exist).