cgat-developers / cgat-flow

cgat-flow repository
MIT License
13 stars 9 forks source link

{is} rnaseqqc repairs #148

Closed IanSudbery closed 2 years ago

IanSudbery commented 2 years ago

A collection of small changes neccesary to get pipeline_rnaseqqc running on our system.

One of these changes is breaking and requires rpy2 version >3


Alters cgatpipelines.tasks.mapping.SubsetHeads so that it uses NR<= rather than NR< in the subsetting code, as this was missing the final line from FASTA records.

This is still very slow on our systems. I'm not sure why, but it feels like we should be able to do better.

See issue #145


Adds r-rjson to the conda env, as required by rnaseqqc see issue #147


Switch plotExpression to matplotlib from rpy2/ggplot. fixes problems arrising from changes in rpy2 to just avoid using rpy2. This patch will make the pipeline work with both old and new rpy2


Salmon compatibility changes

Changes to make rnaseqqc compatible with salmon > 1.0, mostly inthe location and format of metadata, reports and logs. Removes specification for a seperate salmon enbironment


fasta2table section specification

Changes call to fasta2table so that section names are seperated with space rather than ;. This is neccessary because of change to argparse from optparse, but unclear if it is compatible with galaxy?

Also needed to remove =.

See issue #144


Changes for compatibility with modern pandas


Setting xlim in plotStrandednessSalmon

Changes code for setting the x limits and tick locations in plotStandednessSalmon. Previously this relied on the automatic setting of the upper x limit in exactly the correct place, but this was often wrong.


rpy2 >3.0 changes

This change makes rnaseqqc compatible with rpy2 3.1. That probably means that it will make it incompatible with rpy3 <3. A decision will need to be made whether to support this.

Acribbs commented 2 years ago

@IanSudbery sorry been on holiday for a few weeks so I missed all the conversations. I think for the moment we accept your changes and work out a way going forward to remove rpy dep in favour of subprocess calls with Rscripts. @jscaber made quite good progress removing a lot of rpy2 code, but I can see that some still remains.

jscaber commented 2 years ago

Hi this looks great, I have tried to run it on our local cluster, but the main data directory is still not repaired here, need a bit more time to do that, hopefully will receive a directory that has permissions to submit to the cluster later this week.

jscaber commented 2 years ago

@IanSudbery - Have finally tested this, sorry for the delay. Thanks for all the work, it works. We should unpin rpy2. I also had to install r-rjson via conda. <-- You have already done this!

Only problem I encountered in the code is having to increase the mem for the salmon index task to from 8G to 20G. Official mem use according to slurm output is <1GB, but there is a write step of >10GB at the end which seems to fail with a segmentation fault otherwise, this might be cluster-specific. It is currently hard coded.

Other notes - the generation of subsets takes forever.