Open parida007 opened 5 years ago
I would rather recommend that you do not compute epic-effective for hg38, it is included in epic. While the version included is calculated for UCSC, it is not going to make a difference :)
Thanks for reporting :) Why do you want to calculate the value?
I want to calculate the EFFECTIVE_GENOME_FRACTION for hg38 genome. As it is mentioned in the command -"Use a different effective genome fraction than the one included in epic. The default value depends on the genome and readlength, but is a number between 0 and 1." The value will change based on the read length. So for a particular read length I want to compute the effective genome fraction.
I can guarantee you that it will not matter much for epic at least. The difference between 36 and 100 bp for hg38 is < 10% which does not make a difference to the statistics. And I cannot help you run it unfortunately. Do you have a cluster/larger machine to run this on? The problem is that the another piece of software (jellyfish2) uses too much memory.
On Fri, Sep 7, 2018 at 2:39 PM parida007 notifications@github.com wrote:
I want to calculate the EFFECTIVE_GENOME_FRACTION for hg38 genome. As it is mentioned in the command -"Use a different effective genome fraction than the one included in epic. The default value depends on the genome and readlength, but is a number between 0 and 1." The value will change based on the read length. So for a particular read length I want to compute the effective genome fraction.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/biocore-ntnu/epic/issues/87#issuecomment-419426542, or mute the thread https://github.com/notifications/unsubscribe-auth/AQ9I0r_wb62nHVtxJliQY72YiRMQUOk3ks5uYmj6gaJpZM4WeM-J .
Thanks for your information. No I don't have a large cluster to run it. I have run the command with default parameter setting and getting the enrichment result. Can you please help me to understand what does the following means: 2553959578.0 effective_genome_fraction (File: compute_background_probabilites, Log level: DEBUG, Time: Mon, 10 Sep 2018 17:12:44 ) What I have understood is effective genome fraction must be in between 0 and 1. Though initially it shows something like: Used first 10000 reads of SRR524941.bam.bed to estimate a median read length of 36.0 Mean readlength: 36.0004, max readlength: 40, min readlength: 32. (File: find_readlength, Log level: INFO, Time: Mon, 10 Sep 2018 16:54:29 ) Using an effective genome fraction of 0.8269827491300733. (File: genomes, Log level: INFO, Time: Mon, 10 Sep 2018 16:54:29 )
It is just output of the software. It says your readlength is 36 and therefore it chooses a certain egf.
:)
On Tue, Sep 11, 2018 at 7:39 AM parida007 notifications@github.com wrote:
Thanks for your information. No I don't have a large cluster to run it. I have run the command with default parameter setting and getting the enrichment result. Can you please help me to understand what does the following means: 2553959578.0 effective_genome_fraction (File: compute_background_probabilites, Log level: DEBUG, Time: Mon, 10 Sep 2018 17:12:44 ) What I have understood is effective genome fraction must be in between 0 and 1. Though initially it shows something like: Used first 10000 reads of SRR524941.bam.bed to estimate a median read length of 36.0 Mean readlength: 36.0004, max readlength: 40, min readlength: 32. (File: find_readlength, Log level: INFO, Time: Mon, 10 Sep 2018 16:54:29 ) Using an effective genome fraction of 0.8269827491300733. (File: genomes, Log level: INFO, Time: Mon, 10 Sep 2018 16:54:29 )
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/biocore-ntnu/epic/issues/87#issuecomment-420153076, or mute the thread https://github.com/notifications/unsubscribe-auth/AQ9I0uAk7dZKU3N0hH-7Ij526SS7-GVSks5uZ0yQgaJpZM4WeM-J .
That I understood.. but the thing is that can the egf be 2553959578.0 ??
No, that is the effective genome size. Good catch. Should update that :)
On Tue, Sep 11, 2018 at 10:11 AM parida007 notifications@github.com wrote:
That I understood.. but the thing is that can the egf be 2553959578.0 ??
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/biocore-ntnu/epic/issues/87#issuecomment-420186784, or mute the thread https://github.com/notifications/unsubscribe-auth/AQ9I0on3UNYX6iZ0sGNzgh7xNrftWcwEks5uZ3BJgaJpZM4WeM-J .
So it should be 0.8269827491300733... If I am not wrong..
Yes, but the egs is egf * genome size. The code is correct, the logging info not.
On Tue, Sep 11, 2018 at 10:53 AM parida007 notifications@github.com wrote:
So it should be 0.8269827491300733... If I am not wrong..
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/biocore-ntnu/epic/issues/87#issuecomment-420198803, or mute the thread https://github.com/notifications/unsubscribe-auth/AQ9I0lNvY684ig89TcR8CqWzk94uLHVcks5uZ3oXgaJpZM4WeM-J .
cat: write error: Broken pipe
Can you tell me whether this error is having an impact with the final output matrix
No, it is not :)
On Tue, Sep 11, 2018 at 11:29 AM parida007 notifications@github.com wrote:
epic -t SRR524944.bed SRR524945.bed SRR524946.bed -c ../GSM945859-input/SRR504936.bed ../GSM945859-input/SRR504937.bed --genome hg38 --false-discovery-rate-cutoff 0.01 -o enriched_regions_GSM970218.csv -bw bigwigs -sm matrix.gz # epic_version: 0.2.12, pandas_version: 0.23.4 (File: epic, Log level: INFO, Time: Tue, 11 Sep 2018 15:01:20 )
cat: write error: Broken pipe
Can you tell me whether this error is having an impact with the final output matrix
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/biocore-ntnu/epic/issues/87#issuecomment-420209415, or mute the thread https://github.com/notifications/unsubscribe-auth/AQ9I0ssoBafu4G872Zd1VxHyNTeWDIRYks5uZ4KMgaJpZM4WeM-J .
I want to run the epic-effective for human genome hg38. I am using a 16 gb ram for it but after using the command it is showing memory error. using the following command: epic-effective -r 36 GRCh38.p10.genome.fa getting the following error: File analyzed: GRCh38.p10.genome.fa (File: effective_genome_size, Log level: INFO, Time: Thu, 06 Sep 2018 17:28:46 ) Genome length: 3236815040 (File: effective_genome_size, Log level: INFO, Time: Thu, 06 Sep 2018 17:28:46 ) File analyzed: GRCh38.p10.genome.fa Genome length: 3236815040 terminate called after throwing an instance of 'jellyfish::large_hash::array_base<jellyfish::mer_dna_ns::mer_base_static<unsigned long, 0>, unsigned long, atomic::gcc, jellyfish::large_hash::unbounded_array<jellyfish::mer_dna_ns::mer_base_static<unsigned long, 0>, unsigned long, atomic::gcc, allocators::mmap> >::ErrorAllocation' what(): Failed to allocate 20443042440 bytes of memory Aborted (core dumped) Failed to open input file '/tmp/GRCh38.p10.genome.fa.jf' Traceback (most recent call last): File "/usr/local/bin/epic-effective", line 38, in
effective_genome_size(fasta, read_length, nb_cpu, tmpdir)
File "/usr/local/lib/python2.7/dist-packages/epic/scripts/effective_genome_size.py", line 56, in effective_genome_size
shell=True)
File "/usr/lib/python2.7/subprocess.py", line 574, in check_output
raise CalledProcessError(retcode, cmd, output=output)
subprocess.CalledProcessError: Command 'jellyfish stats /tmp/GRCh38.p10.genome.fa.jf' returned non-zero exit status 1
rm: cannot remove '/tmp/GRCh38.p10.genome.fa.jf': No such file or directory
But when I use a single chromosome it works fine. epic-effective -r 36 gencode_chrY.fa Temporary directory: /tmp/ (File: effective_genome_size, Log level: INFO, Time: Fri, 07 Sep 2018 11:00:26 ) File analyzed: gencode_chrY.fa (File: effective_genome_size, Log level: INFO, Time: Fri, 07 Sep 2018 11:00:26 ) Genome length: 57227415 (File: effective_genome_size, Log level: INFO, Time: Fri, 07 Sep 2018 11:00:26 ) File analyzed: gencode_chrY.fa Genome length: 57227415 Number unique 36-mers: 21608143 (File: effective_genome_size, Log level: INFO, Time: Fri, 07 Sep 2018 11:00:39 ) Effective genome size: 0.377583768199 (File: effective_genome_size, Log level: INFO, Time: Fri, 07 Sep 2018 11:00:39 ) Number unique 36-mers: 21608143 Effective genome size: 0.377583768199
So how can overcome this issue for effective genome size calculation for entire genome.