biocore-ntnu / epic

(DEPRECATED) epic: diffuse domain ChIP-Seq caller based on SICER
http://bioepic.readthedocs.io
MIT License
31 stars 6 forks source link

Epic effective for #87

Open parida007 opened 5 years ago

parida007 commented 5 years ago

I want to run the epic-effective for human genome hg38. I am using a 16 gb ram for it but after using the command it is showing memory error. using the following command: epic-effective -r 36 GRCh38.p10.genome.fa getting the following error: File analyzed: GRCh38.p10.genome.fa (File: effective_genome_size, Log level: INFO, Time: Thu, 06 Sep 2018 17:28:46 ) Genome length: 3236815040 (File: effective_genome_size, Log level: INFO, Time: Thu, 06 Sep 2018 17:28:46 ) File analyzed: GRCh38.p10.genome.fa Genome length: 3236815040 terminate called after throwing an instance of 'jellyfish::large_hash::array_base<jellyfish::mer_dna_ns::mer_base_static<unsigned long, 0>, unsigned long, atomic::gcc, jellyfish::large_hash::unbounded_array<jellyfish::mer_dna_ns::mer_base_static<unsigned long, 0>, unsigned long, atomic::gcc, allocators::mmap> >::ErrorAllocation' what(): Failed to allocate 20443042440 bytes of memory Aborted (core dumped) Failed to open input file '/tmp/GRCh38.p10.genome.fa.jf' Traceback (most recent call last): File "/usr/local/bin/epic-effective", line 38, in effective_genome_size(fasta, read_length, nb_cpu, tmpdir) File "/usr/local/lib/python2.7/dist-packages/epic/scripts/effective_genome_size.py", line 56, in effective_genome_size shell=True) File "/usr/lib/python2.7/subprocess.py", line 574, in check_output raise CalledProcessError(retcode, cmd, output=output) subprocess.CalledProcessError: Command 'jellyfish stats /tmp/GRCh38.p10.genome.fa.jf' returned non-zero exit status 1 rm: cannot remove '/tmp/GRCh38.p10.genome.fa.jf': No such file or directory

But when I use a single chromosome it works fine. epic-effective -r 36 gencode_chrY.fa Temporary directory: /tmp/ (File: effective_genome_size, Log level: INFO, Time: Fri, 07 Sep 2018 11:00:26 ) File analyzed: gencode_chrY.fa (File: effective_genome_size, Log level: INFO, Time: Fri, 07 Sep 2018 11:00:26 ) Genome length: 57227415 (File: effective_genome_size, Log level: INFO, Time: Fri, 07 Sep 2018 11:00:26 ) File analyzed: gencode_chrY.fa Genome length: 57227415 Number unique 36-mers: 21608143 (File: effective_genome_size, Log level: INFO, Time: Fri, 07 Sep 2018 11:00:39 ) Effective genome size: 0.377583768199 (File: effective_genome_size, Log level: INFO, Time: Fri, 07 Sep 2018 11:00:39 ) Number unique 36-mers: 21608143 Effective genome size: 0.377583768199

So how can overcome this issue for effective genome size calculation for entire genome.

endrebak commented 5 years ago

I would rather recommend that you do not compute epic-effective for hg38, it is included in epic. While the version included is calculated for UCSC, it is not going to make a difference :)

Thanks for reporting :) Why do you want to calculate the value?

parida007 commented 5 years ago

I want to calculate the EFFECTIVE_GENOME_FRACTION for hg38 genome. As it is mentioned in the command -"Use a different effective genome fraction than the one included in epic. The default value depends on the genome and readlength, but is a number between 0 and 1." The value will change based on the read length. So for a particular read length I want to compute the effective genome fraction.

endrebak commented 5 years ago

I can guarantee you that it will not matter much for epic at least. The difference between 36 and 100 bp for hg38 is < 10% which does not make a difference to the statistics. And I cannot help you run it unfortunately. Do you have a cluster/larger machine to run this on? The problem is that the another piece of software (jellyfish2) uses too much memory.

On Fri, Sep 7, 2018 at 2:39 PM parida007 notifications@github.com wrote:

I want to calculate the EFFECTIVE_GENOME_FRACTION for hg38 genome. As it is mentioned in the command -"Use a different effective genome fraction than the one included in epic. The default value depends on the genome and readlength, but is a number between 0 and 1." The value will change based on the read length. So for a particular read length I want to compute the effective genome fraction.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/biocore-ntnu/epic/issues/87#issuecomment-419426542, or mute the thread https://github.com/notifications/unsubscribe-auth/AQ9I0r_wb62nHVtxJliQY72YiRMQUOk3ks5uYmj6gaJpZM4WeM-J .

parida007 commented 5 years ago

Thanks for your information. No I don't have a large cluster to run it. I have run the command with default parameter setting and getting the enrichment result. Can you please help me to understand what does the following means: 2553959578.0 effective_genome_fraction (File: compute_background_probabilites, Log level: DEBUG, Time: Mon, 10 Sep 2018 17:12:44 ) What I have understood is effective genome fraction must be in between 0 and 1. Though initially it shows something like: Used first 10000 reads of SRR524941.bam.bed to estimate a median read length of 36.0 Mean readlength: 36.0004, max readlength: 40, min readlength: 32. (File: find_readlength, Log level: INFO, Time: Mon, 10 Sep 2018 16:54:29 ) Using an effective genome fraction of 0.8269827491300733. (File: genomes, Log level: INFO, Time: Mon, 10 Sep 2018 16:54:29 )

endrebak commented 5 years ago

It is just output of the software. It says your readlength is 36 and therefore it chooses a certain egf.

:)

On Tue, Sep 11, 2018 at 7:39 AM parida007 notifications@github.com wrote:

Thanks for your information. No I don't have a large cluster to run it. I have run the command with default parameter setting and getting the enrichment result. Can you please help me to understand what does the following means: 2553959578.0 effective_genome_fraction (File: compute_background_probabilites, Log level: DEBUG, Time: Mon, 10 Sep 2018 17:12:44 ) What I have understood is effective genome fraction must be in between 0 and 1. Though initially it shows something like: Used first 10000 reads of SRR524941.bam.bed to estimate a median read length of 36.0 Mean readlength: 36.0004, max readlength: 40, min readlength: 32. (File: find_readlength, Log level: INFO, Time: Mon, 10 Sep 2018 16:54:29 ) Using an effective genome fraction of 0.8269827491300733. (File: genomes, Log level: INFO, Time: Mon, 10 Sep 2018 16:54:29 )

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/biocore-ntnu/epic/issues/87#issuecomment-420153076, or mute the thread https://github.com/notifications/unsubscribe-auth/AQ9I0uAk7dZKU3N0hH-7Ij526SS7-GVSks5uZ0yQgaJpZM4WeM-J .

parida007 commented 5 years ago

That I understood.. but the thing is that can the egf be 2553959578.0 ??

endrebak commented 5 years ago

No, that is the effective genome size. Good catch. Should update that :)

On Tue, Sep 11, 2018 at 10:11 AM parida007 notifications@github.com wrote:

That I understood.. but the thing is that can the egf be 2553959578.0 ??

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/biocore-ntnu/epic/issues/87#issuecomment-420186784, or mute the thread https://github.com/notifications/unsubscribe-auth/AQ9I0on3UNYX6iZ0sGNzgh7xNrftWcwEks5uZ3BJgaJpZM4WeM-J .

parida007 commented 5 years ago

So it should be 0.8269827491300733... If I am not wrong..

endrebak commented 5 years ago

Yes, but the egs is egf * genome size. The code is correct, the logging info not.

On Tue, Sep 11, 2018 at 10:53 AM parida007 notifications@github.com wrote:

So it should be 0.8269827491300733... If I am not wrong..

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/biocore-ntnu/epic/issues/87#issuecomment-420198803, or mute the thread https://github.com/notifications/unsubscribe-auth/AQ9I0lNvY684ig89TcR8CqWzk94uLHVcks5uZ3oXgaJpZM4WeM-J .

parida007 commented 5 years ago

epic -t SRR524944.bed SRR524945.bed SRR524946.bed -c ../GSM945859-input/SRR504936.bed ../GSM945859-input/SRR504937.bed --genome hg38 --false-discovery-rate-cutoff 0.01 -o enriched_regions_GSM970218.csv -bw bigwigs -sm matrix.gz # epic_version: 0.2.12, pandas_version: 0.23.4 (File: epic, Log level: INFO, Time: Tue, 11 Sep 2018 15:01:20 )

cat: write error: Broken pipe

Can you tell me whether this error is having an impact with the final output matrix

endrebak commented 5 years ago

No, it is not :)

On Tue, Sep 11, 2018 at 11:29 AM parida007 notifications@github.com wrote:

epic -t SRR524944.bed SRR524945.bed SRR524946.bed -c ../GSM945859-input/SRR504936.bed ../GSM945859-input/SRR504937.bed --genome hg38 --false-discovery-rate-cutoff 0.01 -o enriched_regions_GSM970218.csv -bw bigwigs -sm matrix.gz # epic_version: 0.2.12, pandas_version: 0.23.4 (File: epic, Log level: INFO, Time: Tue, 11 Sep 2018 15:01:20 )

cat: write error: Broken pipe

Can you tell me whether this error is having an impact with the final output matrix

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/biocore-ntnu/epic/issues/87#issuecomment-420209415, or mute the thread https://github.com/notifications/unsubscribe-auth/AQ9I0ssoBafu4G872Zd1VxHyNTeWDIRYks5uZ4KMgaJpZM4WeM-J .