CostaLab / reg-gen

Regulatory Genomics Toolbox: Python library and set of tools for the integrative analysis of high throughput regulatory genomics data.
https://reg-gen.readthedocs.io/
Other
105 stars 30 forks source link

rgt-hint differential --- [Errno 101] Network is unreachable #121

Closed PFRoux closed 5 years ago

PFRoux commented 5 years ago

Dear RGT dev team,

While using PIQ and Wellington for years now on ATAC-seq data, I am enjoying shifting to RGT since it is outperforming those methods and is far easier to implement.

After some successful tests on my laptop (Mac Book Pro 2.7 GHz Intel Core i7 16 Go LPDDR3), I decided to install RGT on our local CentOS Computing Cluster here at Institut Pasteur. I indeed have some heavy analyses to perform, comparing dozens of conditions to each other.

While the first steps (rgt-hint footprinting and rgt-motifanalysis matching) went perfectly fine, the heaviest one (i.e rgt-hint differential) failed. Indeed, after roughly 12 hours running, the "Lineplots" output folder started to get filled with .pwm and .txt files but then the run get stopped throwing this error.

sbatch --mem-per-cpu 128000 --gres=disk:128000 --wrap="rgt-hint differential --organism=hg19 --bc --nc 5 --mpbs-file1=./match/D0ATAC_MERGE_123_DEDUP_NOBLACKLIST_DOWNSAMPLED_mpbs.bed --mpbs-file2=./match/D1ATAC_MERGE_123_DEDUP_NOBLACKLIST_DOWNSAMPLED_mpbs.bed --reads-file1=./Data/1-NOBLACKLIST_Bam/D0ATAC_MERGE_123_DEDUP_NOBLACKLIST_DOWNSAMPLED.bam --reads-file2=./Data/1-NOBLACKLIST_Bam/D1ATAC_MERGE_123_DEDUP_NOBLACKLIST_DOWNSAMPLED.bam --condition1=WI38_RAS_D0 --condition2=WI38_RAS_D1 --output-location=./Data/5-FOOTPRINTING/WI38_D0_vs_D1"

Traceback (most recent call last):
  File "/pasteur/homes/piroux/.local/bin/rgt-hint", line 11, in <module>
    load_entry_point('RGT==0.12.1', 'console_scripts', 'rgt-hint')()
  File "build/bdist.linux-x86_64/egg/rgt/HINT/Main.py", line 90, in main
  File "build/bdist.linux-x86_64/egg/rgt/HINT/DifferentialAnalysis.py", line 311, in diff_analysis_run
  File "/pasteur/homes/piroux/miniconda2/lib/python2.7/multiprocessing/pool.py", line 253, in map
    return self.map_async(func, iterable, chunksize).get()
  File "/pasteur/homes/piroux/miniconda2/lib/python2.7/multiprocessing/pool.py", line 572, in get
    raise self._value
urllib2.URLError: <urlopen error [Errno 101] Network is unreachable>

The error message is pretty clear : the tool is trying to download something. The problem is that, for security reason, the computation nodes on our cluster have no access to the Internet.

1) Is there a workaround ? Maybe something there is something to to install first before launching the run on computation node ? When installing the tool (I struggled a bit to install some dependencies) everything seems to be fine at the end, so I don't think I am missing anything.

2) Is there a way to save / write temporary files generated by rgt-hint differential to avoid the 12 hours computations it took before starting writing done the corrected footptints when something fails afterwards ? I guess that, during these 12 hours, the tool is parsing the .bam files and compute the corrected Tn5 cut matrices for each PWM and each hit independently. That would be super nice to add an option allowing to get access to those matrices.

Thanks a lot for this awesome tool and for help / advice.

Cheers !

Pef

lzj1769 commented 5 years ago

Hi Pef,

The error message is pretty clear : the tool is trying to download something. The problem is that, for security reason, the computation nodes on our cluster have no access to the Internet.

You are right. During the generation of line plots, RGT needs to connect weblogo3 to create the motif logos.

1) Is there a workaround ? Maybe something there is something to install first before launching the run on computation node?

I think the only to do such a job without accessing the Internet is to make weblogo3 locally workable. I will put it into our TODOs.

2) Is there a way to save / write temporary files generated by rgt-hint differential to avoid the 12 hours computations it took before starting writing done the corrected footptints when something fails afterwards ?

Actually, you can use multi-processor option in rgt-hint differential by setting --nc to a number greater than 1. By default, we only run it using 1 cpu. Type rgt-hint differential --help to see more information.

Best, Li

PFRoux commented 5 years ago

Hi Li,

Thanks a lot for your kind answer.

Before you find time to solve this "problem" is there an easy workaround to run the differential analysis without outputting the pdf ? By removing some lines somewhere in the code maybe ?

Thans a lot again !

Have a nice day.

Pef

lzj1769 commented 5 years ago

Hi Pef,

By removing some lines somewhere in the code maybe ?

yes, it's quite simple. please comment the codes from line 304 to 311 in DifferentialAnalysis.py. Afterward, you will still have the .pwm and .txt files in Lineplots folder, which can be used to create the line plots manually.

Best, Li

PFRoux commented 5 years ago

Thanks a lot !

Just a naive question : do I have to make this change in the source code and then re-compile ? Sorry. I am not coding at all in Python ...

Best,

Pef

lzj1769 commented 5 years ago

Hi,

Just a naive question : do I have to make this change in the source code and then re-compile ?

Yes, you can do it as following: first, uninstall RGT and clone the repo to your machine: pip uninstall RGT git clone git@github.com:CostaLab/reg-gen.git

next, modify the codes by commenting the codes from line 304 to 311 in reg-gen/rgt/HINT/DifferentialAnalysis.py.

finally, go to reg-gen directory and re-install RGT: pip install ./ --user

then everything should be fine.

Best, Li

PFRoux commented 5 years ago

Dear Li,

Sorry to bother you again. Applying your suggestions, I managed to run the rgt-hint differential analysis without getting any error and get _statistics.txt and _statistics.pdf files. Nevertheless, I didn't get the .txt and the .pwm files per TF. Is there another trick that would allow me to get them ?

All the best !

Pef

lzj1769 commented 5 years ago

Hi Pef,

Sorry to bother you again.

Not a problem, we are always happy to provide support about RGT.

Applying your suggestions, I managed to run the rgt-hint differential analysis without getting any error and get _statistics.txt and _statistics.pdf files. Nevertheless, I didn't get the .txt and the .pwm files per TF. Is there another trick that would allow me to get them ?

Now I see. The output of .txt and .pwm are performed in function line_plot.
So let's modify DifferentialAnalysis.py again as following:

First, uncomment the codes from line 304 to 311, as we are going to use function 'line_plot' for output. Then comment the codes from line 543 to 596, which means now we only output the signal and PWM. Finally re-install RGT:

pip uninstall RGT
pip install ./ --user

Let me know if you have any questions

Cheers, Li

PFRoux commented 5 years ago

Dear Li,

I would like to warmly thank you for your advices, which were fruitful. I managed to run an end-to-end analysis without any problem.

I would like to ask you some additional questions :

1) Does subsampling the .bam file beforehand so that the two samples to be compared have the same number of alignments makes a difference in the rgt-hint differential analysis ?

2) I am working on time-courses, and instead comparing samples pairs, I would rather like to look at the evolution of TF activities. Is there an easy way to achieve this ? I looked at the _statistics.txt files given as an output of rgt-hint differential but it's not clear to me what is the "TC" field ? Furthermore I notice that, as a results of 2 differential analyses with the same condition1, the value given for TC_Condition1 and Protection_Conditionc1 are different between the 2 runs - which is likely the results of the normalization. Is there a way to deal with this when comparing multiple conditions ?

3) After running a differential analysis between 2 conditions, I managed to get a shortlist of "differentially active TFs". When inspecting the footprints for some of those, I noticed that some are quite suspicious (see the picture below) while really convincing for others. What could explain such weak footprints being called (the suspicious profile is the one related to a TFs having around 1000 binding sites) ? And is there a way to filter them ?

image

image

Many thanks again for you help and advice.

Cheers,

Pef

lzj1769 commented 5 years ago

Hi,

Does subsampling the .bam file beforehand so that the two samples to be compared have the same number of alignments makes a difference in the rgt-hint differential analysis ?

I don't think so. Because during differential analysis, the libraries have been normalized to reduce impact from library depth.

I am working on time-courses, and instead comparing samples pairs, I would rather like to look at the evolution of TF activities. Is there an easy way to achieve this ?

For time-course data, I recommend comparing every two stages fo the data, for example, you have three samples from D1, D5, D10, D15, what you can do is to do differential analysis between D1-D5, D5-D10, D10-D15, and then you can collect the significant different TFs followed by making a heatmap for visualization.

I looked at the _statistics.txt files given as an output of rgt-hint differential but it's not clear to me what is the "TC" field ?

TC means tag count, which is the number of reads around a predicted binding site and can be used to indicate the chromatin accessibility.

Furthermore I notice that, as a results of 2 differential analyses with the same condition1, the value given for TC_Condition1 and Protection_Conditionc1 are different between the 2 runs - which is likely the results of the normalization. Is there a way to deal with this when comparing multiple conditions?

You are right, the differences are from normalization. For multiple conditions, a workaround will be comparing one condition vs. others so you can find condition-specific TFs.

After running a differential analysis between 2 conditions, I managed to get a shortlist of "differentially active TFs". When inspecting the footprints for some of those, I noticed that some are quite suspicious (see the picture below) while really convincing for others. What could explain such weak footprints being called (the suspicious profile is the one related to a TFs having around 1000 binding sites) ? And is there a way to filter them ?

Either the weak footprints are caused by too few binding sites or some factors have very short residence time, see this in https://www.nature.com/articles/nmeth.3772.pdf, TF residence time.

For filtering, there is a column in _statistics.txt called Num, which can be used to filter out these factors with low number binding sites. In addition, Protection_Score can be used for these short residence time TFs. A negative protection score usually means this factor has very short binding time and thus leaves an unclear footprint.

Best, Li

fabio-t commented 5 years ago

@lzj1769 can this be closed?