Relaunching divergence calculation

SwiftSeal commented 5 months ago

Hi @TobyBaril

I'm experiencing issues with the repeat landscape figure not been generated, looks like it's due to the divergence calculation script? SLURM is also reporting OOM issues (192G peak mem on 600mb genome), not sure if that could be related? I'm fairly certain this was running correctly on the same genome a few months ago, but have updated since then. Is there a recommended method for rerunning this step of the pipeline specifically? Otherwise the pipeline is finishing correctly and outputs look sensible.

Thanks in advance!

Starting calculations
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/mnt/shared/scratch/msmith/apps/conda/envs/earlgrey/lib/python3.9/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/mnt/shared/scratch/msmith/apps/conda/envs/earlgrey/lib/python3.9/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
  File "/mnt/shared/scratch/msmith/apps/conda/envs/earlgrey/share/earlgrey-4.2.4-0/scripts//divergenceCalc/divergence_calc.py", line 138, in outer_func
    aln = AlignIO.read(query_path+".water", 'fasta')
  File "/mnt/shared/scratch/msmith/apps/conda/envs/earlgrey/lib/python3.9/site-packages/Bio/AlignIO/__init__.py", line 383, in read
    alignment = next(iterator)
  File "/mnt/shared/scratch/msmith/apps/conda/envs/earlgrey/lib/python3.9/site-packages/Bio/AlignIO/__init__.py", line 334, in parse
    yield from i
  File "/mnt/shared/scratch/msmith/apps/conda/envs/earlgrey/lib/python3.9/site-packages/Bio/AlignIO/__init__.py", line 276, in _SeqIO_to_alignment_iterator
    yield MultipleSeqAlignment(records)
  File "/mnt/shared/scratch/msmith/apps/conda/envs/earlgrey/lib/python3.9/site-packages/Bio/Align/__init__.py", line 186, in __init__
    self.extend(records)
  File "/mnt/shared/scratch/msmith/apps/conda/envs/earlgrey/lib/python3.9/site-packages/Bio/Align/__init__.py", line 494, in extend
    self._append(rec, expected_length)
  File "/mnt/shared/scratch/msmith/apps/conda/envs/earlgrey/lib/python3.9/site-packages/Bio/Align/__init__.py", line 556, in _append
    raise ValueError("Sequences must all be the same length")
ValueError: Sequences must all be the same length
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/mnt/shared/scratch/msmith/apps/conda/envs/earlgrey/share/earlgrey-4.2.4-0/scripts//divergenceCalc/divergence_calc.py", line 203, in <module>
    results = pool.map(func, chunks)
  File "/mnt/shared/scratch/msmith/apps/conda/envs/earlgrey/lib/python3.9/multiprocessing/pool.py", line 364, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/mnt/shared/scratch/msmith/apps/conda/envs/earlgrey/lib/python3.9/multiprocessing/pool.py", line 771, in get
    raise self._value
ValueError: Sequences must all be the same length

jamesdgalbraith commented 5 months ago

Hey @SwiftSeal . If you'd like to run the divergence plotting scripts seperate;ly so we can bug fix them the scripts we incoorporated in can be found here: https://github.com/jamesdgalbraith/EarlGreyDivergenceCalc.

The Python and R packages used are all in the current EarlGrey conda environment so if it's loaded it should be fine. If you'd rather use a fresh conda environment I've listed all the required packages there.

In terms of the error, would you be able to share the .water files currently in the ${OUTDIR}/${species}_RepeatLandscape/tmp/ folder? The error appears to be due to BioAlign trying to read alignments from EMBOSS water containing sequences of unequal length, which (normally) isn't possible for water to produce!

On the memory issue front, how many threads were you running EarlGrey with? If the issue is arising at this stage at your using a lot of threads this should be an easy fix on our end.

SwiftSeal commented 5 months ago

@jamesdgalbraith

Thanks for the quick response :)

I'm running that just now - will let you know how it goes! I was using 32 threads initially, I've scaled this back to 16 and gave it 2TB of mem for the hell of it, as it didn't seem to be impacting performance too much. For some reason it is now warning:

WARNING. chromosome (chr05) was not found in the FASTA file. Skipping.
WARNING. chromosome (chr12) was not found in the FASTA file. Skipping.

I've attached all the .water files under that directory for the previous run below:

qseqs.zip

jamesdgalbraith commented 5 months ago

@SwiftSeal Thanks for that.

There may have been a bug fix in my the attempted bug fixes I uploaded to the other repo an hour ago. Fotunately the latest commit fixes them and should be able to ignore faulty water alignments. There don't appear to be any faulty water alignments in the files you sent through so I'm quite confused as to what caused the previous error!

I believe the newest error seems to be due to pybedtools being unable to find the scaffolds called chr05 and chr12 in the genome file.

SwiftSeal commented 5 months ago

Great that fix worked thank you! Aye I had a look at the water alignments but couldn't see that either... The latest run finished and still failed at that step, so seems to be consistent for this genome, but all good otherwise.

TobyBaril commented 5 months ago

Hi @SwiftSeal ! Thanks for pointing this out. I let James know and seems like he has sorted it. I've added this patch for the next release which will probably go live sometime today!

TobyBaril / EarlGrey

Relaunching divergence calculation #113