StuartJEBaird / diem

Diagnostic index expectation maximisation (diem) algorithm implementations
6 stars 0 forks source link

Running diem on whole genome #5

Open simonharnqvist opened 4 months ago

simonharnqvist commented 4 months ago

Hi @StuartJEBaird - taking you up on your offer on how to run diem on genome scale data. Raising an issue here so it becomes part of the documentation, but feel free to direct me elsewhere.

First things first - the nCores parameter in diemR does not seem to be working on macOS; CPU load is not responsive to the requested number of threads. I'm not an R user: is there's something else I need to do to enable R to multiprocess? (Given files are per chromosome I can work around this easily, but just thought I should let you know)

Other than multiprocessing, what are some tricks to make this run smoothly on a whole genome?

Thanks in advance Simon

StuartJEBaird commented 4 months ago

Hi Simon,

Regarding support: When it comes to R specifics, Natalia is our guru! (Not me)

If you are not an R user, then why not try our Python implementation?

It goes faster than R....and (of course) accepts the same input, does the same thing, gives the same output. If you have your diem parameter set up for R it should not be too much trouble to given them as arguments to the Python diem function.

Your data is large, so my tip would be: Divide it up into a multiple of nCoresAvailable ~equal-sized compartments.

This optimisation can be fiddly if you are including non-conventional ploidy Chromosomes...as each compartment has only one ploidy spec.

(A compartment ploidy spec is just a list of the ploidy of each individual in the input - this is a function of gender for W,X,Y,Z)

If your chr sizes are of the same order of magnitude, then it's probably easier to stick with one compartment per chr - it will just take a bit longer.

Keep the questions coming,

Best

S

On Fri, 26 Apr 2024 at 12:57, simonharnqvist @.***> wrote:

Hi @StuartJEBaird https://github.com/StuartJEBaird - taking you up on your offer on how to run diem on genome scale data. Raising an issue here so it becomes part of the documentation, but feel free to direct me elsewhere.

First things first - the nCores parameter in diemR does not seem to be working on macOS; CPU load is not responsive to the requested number of threads. I'm not an R user: is there's something else I need to do to enable R to multiprocess? (Given files are per chromosome I can work around this easily, but just thought I should let you know)

Other than multiprocessing, what are some tricks to make this run smoothly on a whole genome?

Thanks in advance Simon

— Reply to this email directly, view it on GitHub https://github.com/StuartJEBaird/diem/issues/5, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDCV54ETQQE635OCHWU55LY7IXJ5AVCNFSM6AAAAABG2SBLTOVHI2DSMVQWIX3LMV43ASLTON2WKOZSGI3DKNJTGY4TIMI . You are receiving this because you were mentioned.Message ID: @.***>

simonharnqvist commented 4 months ago

Thanks @StuartJEBaird! I would definitely prefer using Python, but I couldn't find any documentation for the Python version

StuartJEBaird commented 4 months ago

You were able to download diempy?

The diem function is identical across all platforms, and so usage and the arguments should be the same...

ie if you have the arguments for diemr then they should be the arguments for diempy.

This identity is true for diempy/diemmca, but may not hold perfectly between diemr/diempy. (Natalia has been developing a wrapper round the core R function).

We can fill in this part of the support right now if you wish...

What are your arguments to the diemr version?

On my side, I will look out the python calls I used for the diem publication. (I use diemmca day to day).

Best

S

On Tue, 30 Apr 2024 at 12:21, simonharnqvist @.***> wrote:

Thanks @StuartJEBaird https://github.com/StuartJEBaird! I would definitely prefer using Python, but I couldn't find any documentation for the Python version

— Reply to this email directly, view it on GitHub https://github.com/StuartJEBaird/diem/issues/5#issuecomment-2084926345, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDCV5Z246L3GQJ6X6Z554DY75WELAVCNFSM6AAAAABG2SBLTOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBUHEZDMMZUGU . You are receiving this because you were mentioned.Message ID: @.***>

simonharnqvist commented 4 months ago

Hmm - the function signatures are very different:

diem.py

def diem(PhiW, CompartmentNames, CompartmentPloidies, datapath, outputPath, ChosenInds, diemMaxInterations, Epsilon, verbose, processes):

vs

diem.R

diem <- function(files, ploidy = list(2), markerPolarity = FALSE, ChosenInds,
                 epsilon = 0.99999, verbose = FALSE, nCores = parallel::detectCores() - 1,
                 maxIterations = 50, ...)
StuartJEBaird commented 4 months ago

Hi Simon!

The final 5 arguments are identical:

ChosenInds, diemMaxInterations (maxIterations), Epsilon, verbose, processes (nCores)

leaving:

PhiW, CompartmentNames, CompartmentPloidies, datapath, outputPath

CompartmentNames is a list of strings... datapath can end in "/", though this is optional as the internal code is:

CompartmentDatapaths = [ os.path.join(datapath, name) for name in CompartmentNames ]

outputPath is obvious.

The only non-obvs inputs are then 2: PhiW, CompartmentPloidies

PhiW is the Null Polarisation you wish to use. To generate a Null Polarisation for total N[sites] NS, use a random number generator to generate a list of NS "1" or "2" characters. Subdivide this list and concatenate to form N[compartments] strings "1211212211..." The string lengths of these strings are the numbers of sites in each compartment.

Here is some pseudocode: CompartmentLengths = Map[Length,Compartments] NS = Total[ CompartmentLengths ]

PhiW = RandomChoice[ {"1","2"}, NS ]

PhiW= TakeList[ PhiW, CompartmentLengths ]

PhiW = Map[StringJoin, PhiW]

Each time you generate a Null it is (very) likely to be unique, so it is good practice to save it to file before passing it to diem as an argument. (You can then guarantee replicability of any results, as diem is deterministic).

CompartmentPloidies details the ploidy for each individual, for each data compartment It is simplest to set all compartment ploidies the same (normally 2... see diemr default)... but you may need to be more exact, and divide your sites into sets of equal ploidy such as Autosome, W, X, Y, Z, mt

For diploid autosomes and mt all individuals have the same ploidy: {2,2,2,2,2...} and {1,1,1,1,1...} respectively. For data from males=M, females=F, {MFMFMFMFMFMFMF....} X and Z compartments have ploidy: {1,2,1,2,1...} and {2,1,2,1,2...} respectively. Y and W compartments have ploidy: {1,0,1,0,1...} and {0,1,0,1,0...} respectively.

You have to determine your own {MFMMFFF...} sequence from the metadata of your genomes (as ordered in your diem input sites).

This makes CompartmentPloidies is an N[compartments] x N[individuals] matrix of whole numbers.

Is this sufficient info for you to get diempy running?

Checking my diem Manual: it is very much a work in progress, so would not yet be useful for you, but if the above is sufficient I will use it as a guide to the relevant Manual section.

Bestest

Stuart

On Tue, 30 Apr 2024 at 15:46, simonharnqvist @.***> wrote:

Hmm - the function signatures are very different:

diem.py

def diem(PhiW, CompartmentNames, CompartmentPloidies, datapath, outputPath, ChosenInds, diemMaxInterations, Epsilon, verbose, processes):

vs

diem.R

diem <- function(files, ploidy = list(2), markerPolarity = FALSE, ChosenInds, epsilon = 0.99999, verbose = FALSE, nCores = parallel::detectCores() - 1, maxIterations = 50, ...)

— Reply to this email directly, view it on GitHub https://github.com/StuartJEBaird/diem/issues/5#issuecomment-2085378799, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDCV56NFQB4R4QOBEZCJE3Y76OETAVCNFSM6AAAAABG2SBLTOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBVGM3TQNZZHE . You are receiving this because you were mentioned.Message ID: @.***>

simonharnqvist commented 4 months ago

Hi Stuart,

I was being very silly - of course it doesn't use more than one core when I've only provided one input file. As far as I can tell, it works on MacOS - will run a proper test overnight.

With diempy, I'm getting an error - I'll stick the traceback below for your information, but I'm happy enough with diemR now.

chr14_len = int(subprocess.check_output(["wc", "-l", "diem_files/diem_input/brenthis_ino.SP_BI_364.chromosome_14.diem.txt"]).split()[0])
rng = np.random.default_rng(seed=42)
phi_w = "".join(map(str, rng.integers(low=1, high=3, size=chr14_len)))

diem(PhiW=phi_w, CompartmentNames=["brenthis_ino.SP_BI_364.chromosome_14.diem.txt"], 
     CompartmentPloidies=[2], datapath="diem_files/diem_input", 
     outputPath="diempy_test", ChosenInds=list(range(13)))

I get a NumPy array creation error (np 1.24):

Traceback (ignore the VSCode links - I'm too lazy to crop them out):

RemoteTraceback                           Traceback (most recent call last)
RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/Users/s2341012/mambaforge/envs/diem/lib/python3.11/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
                    ^^^^^^^^^^^^^^^^^^^
  File "/Users/s2341012/mambaforge/envs/diem/lib/python3.11/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
           ^^^^^^^^^^^^^^^^
  File "/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py", line 205, in GetI4ofOneCompartment
    def GetI4ofOneCompartment(args): return GetI4ofOneCompartment2args(*args)
                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py", line 202, in GetI4ofOneCompartment2args
    return np.array([csStateCount(Ind) for Ind in Inds], dtype=np.uint32)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (14,) + inhomogeneous part.
"""

The above exception was the direct cause of the following exception:

ValueError                                Traceback (most recent call last)
Cell In[22], [line 1](vscode-notebook-cell:?execution_count=22&line=1)
----> [1](vscode-notebook-cell:?execution_count=22&line=1) diem(PhiW=phi_w, CompartmentNames=["brenthis_ino.SP_BI_364.chromosome_14.diem.txt"], 
      [2](vscode-notebook-cell:?execution_count=22&line=2)      CompartmentPloidies=[2], datapath="diem_files/diem_input", 
      [3](vscode-notebook-cell:?execution_count=22&line=3)      outputPath="diempy_test", ChosenInds=list(range(13)))

File [~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:459](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:459), in diem(PhiW, CompartmentNames, CompartmentPloidies, datapath, outputPath, ChosenInds, diemMaxIterations, Epsilon, verbose, processes)
    [457](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:457) # ACTION : Measure initial state
    [458](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:458)     vprint("Starting big state counts..")
--> [459](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:459)     I4compartments = AbsoluteTiming(GetI4ofCompartments, PhiW)
    [460](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:460) #    I4compartments = AbsoluteTimingNoArg(GetI4ofCompartments);
    [461](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:461)     vprint("I4 : ", I4compartments[0], " seconds.")

File [~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:311](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:311), in AbsoluteTiming(f, arg)
    [309](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:309) def AbsoluteTiming(f, arg):
    [310](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:310)     start_time = time.time()
--> [311](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:311)     ans = f(arg)
    [312](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:312)     duration = time.time() - start_time
    [313](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:313)     return [duration, ans]

File [~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:388](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:388), in diem.<locals>.GetI4ofCompartments(markerLabelsForCompartments)
    [384](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:384) def GetI4ofCompartments(markerLabelsForCompartments):
    [385](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:385)     #    def GetI4ofCompartments():
    [386](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:386)     compInputs = list(
    [387](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:387)         zip(CompartmentDatapaths, markerLabelsForCompartments))
--> [388](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:388)     return np.array(ParallelMap(GetI4ofOneCompartment, compInputs))

File [~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:369](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:369), in diem.<locals>.ParallelMap(f, arglist)
    [367](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:367) def ParallelMap(f, arglist):
    [368](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:368)     with multiprocessing.Pool(processes=processes) as pool:
--> [369](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:369)         ans = pool.map(f, arglist)
    [370](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:370)         while len(pool._cache) > 0:
    [371](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:371)             sleep(0.1)

File [~/mambaforge/envs/diem/lib/python3.11/multiprocessing/pool.py:367](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/mambaforge/envs/diem/lib/python3.11/multiprocessing/pool.py:367), in Pool.map(self, func, iterable, chunksize)
    [362](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/mambaforge/envs/diem/lib/python3.11/multiprocessing/pool.py:362) def map(self, func, iterable, chunksize=None):
    [363](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/mambaforge/envs/diem/lib/python3.11/multiprocessing/pool.py:363)     '''
    [364](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/mambaforge/envs/diem/lib/python3.11/multiprocessing/pool.py:364)     Apply `func` to each element in `iterable`, collecting the results
    [365](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/mambaforge/envs/diem/lib/python3.11/multiprocessing/pool.py:365)     in a list that is returned.
    [366](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/mambaforge/envs/diem/lib/python3.11/multiprocessing/pool.py:366)     '''
--> [367](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/mambaforge/envs/diem/lib/python3.11/multiprocessing/pool.py:367)     return self._map_async(func, iterable, mapstar, chunksize).get()

File [~/mambaforge/envs/diem/lib/python3.11/multiprocessing/pool.py:774](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/mambaforge/envs/diem/lib/python3.11/multiprocessing/pool.py:774), in ApplyResult.get(self, timeout)
    [772](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/mambaforge/envs/diem/lib/python3.11/multiprocessing/pool.py:772)     return self._value
    [773](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/mambaforge/envs/diem/lib/python3.11/multiprocessing/pool.py:773) else:
--> [774](https://file+.vscode-resource.vscode-cdn.net/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/~/mambaforge/envs/diem/lib/python3.11/multiprocessing/pool.py:774)     raise self._value

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (14,) + inhomogeneous part.

No idea what the problem is, so I'll leave this issue open in case you want to investigate further.

Will let you know how the bigger test goes - thanks again! Simon

StuartJEBaird commented 4 months ago

Hi Simon!

On Thu, 2 May 2024 at 13:52, simonharnqvist @.***> wrote:

Hi Stuart,

I was being very silly - of course it doesn't use more than one core when I've only provided one input file. As far as I can tell, it works on MacOS

  • will run a proper test overnight.

I will tell Natalia - she will be pleased.

With diempy, I'm getting an error - I'll stick the traceback below for your information, but I'm happy enough with diemR now.

I would be happy if you try the following for diempy

It seems you called diempy like this:

diem(PhiW=phi_w, CompartmentNames=["brenthis_ino.SP_BI_364.chromosome_14.diem.txt"], CompartmentPloidies=[2], datapath="diem_files/diem_input", outputPath="diempy_test", ChosenInds=list(range(13)))

But CompartmentPloidies is "an N[compartments] x N[individuals] matrix" (see passim), so (as the compatement name list has only one entry) try calling it something like this:

diem(PhiW=phi_w, CompartmentNames=["brenthis_ino.SP_BI_364.chromosome_14.diem.txt"], CompartmentPloidies=[ [2 for number in range(13)] ], datapath="diem_files/diem_input", outputPath="diempy_test", ChosenInds=list(range(13)))

such that compartmentPloidies also only has one entry: a list of ploidies for all 13 individuals (for the one compartment in the list).

At some point I will put a wrapper round the core function to catch stuff like this...

Bestest

S

chr14_len = int(subprocess.check_output(["wc", "-l", "diem_files/diem_input/brenthis_ino.SP_BI_364.chromosome_14.diem.txt"]).split()[0])rng = np.random.default_rng(seed=42)phi_w = "".join(map(str, rng.integers(low=1, high=3, size=chr14_len))) diem(PhiW=phi_w, CompartmentNames=["brenthis_ino.SP_BI_364.chromosome_14.diem.txt"], CompartmentPloidies=[2], datapath="diem_files/diem_input", outputPath="diempy_test", ChosenInds=list(range(13)))

I get a NumPy array creation error (np 1.24):

Traceback (ignore the VSCode links - I'm too lazy to crop them out):

RemoteTraceback Traceback (most recent call last) RemoteTraceback: """ Traceback (most recent call last): File "/Users/s2341012/mambaforge/envs/diem/lib/python3.11/multiprocessing/pool.py", line 125, in worker result = (True, func(*args, *kwds)) ^^^^^^^^^^^^^^^^^^^ File "/Users/s2341012/mambaforge/envs/diem/lib/python3.11/multiprocessing/pool.py", line 48, in mapstar return list(map(args)) ^^^^^^^^^^^^^^^^ File "/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py", line 205, in GetI4ofOneCompartment def GetI4ofOneCompartment(args): return GetI4ofOneCompartment2args(*args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/s2341012/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py", line 202, in GetI4ofOneCompartment2args return np.array([csStateCount(Ind) for Ind in Inds], dtype=np.uint32) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (14,) + inhomogeneous part. """

The above exception was the direct cause of the following exception:

ValueError Traceback (most recent call last) Cell In[22], line 1 ----> 1 diem(PhiW=phi_w, CompartmentNames=["brenthis_ino.SP_BI_364.chromosome_14.diem.txt"], 2 CompartmentPloidies=[2], datapath="diem_files/diem_input", 3 outputPath="diempy_test", ChosenInds=list(range(13)))

File ~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:459, in diem(PhiW, CompartmentNames, CompartmentPloidies, datapath, outputPath, ChosenInds, diemMaxIterations, Epsilon, verbose, processes) 457 # ACTION : Measure initial state 458 vprint("Starting big state counts..") --> 459 I4compartments = AbsoluteTiming(GetI4ofCompartments, PhiW) 460 # I4compartments = AbsoluteTimingNoArg(GetI4ofCompartments); 461 vprint("I4 : ", I4compartments[0], " seconds.")

File ~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:311, in AbsoluteTiming(f, arg) 309 def AbsoluteTiming(f, arg): 310 start_time = time.time() --> 311 ans = f(arg) 312 duration = time.time() - start_time 313 return [duration, ans]

File ~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:388, in diem..GetI4ofCompartments(markerLabelsForCompartments) 384 def GetI4ofCompartments(markerLabelsForCompartments): 385 # def GetI4ofCompartments(): 386 compInputs = list( 387 zip(CompartmentDatapaths, markerLabelsForCompartments)) --> 388 return np.array(ParallelMap(GetI4ofOneCompartment, compInputs))

File ~/Library/CloudStorage/Dropbox/DiemAnalysis/diem/diempy/diempy_core/diem.py:369, in diem..ParallelMap(f, arglist) 367 def ParallelMap(f, arglist): 368 with multiprocessing.Pool(processes=processes) as pool: --> 369 ans = pool.map(f, arglist) 370 while len(pool._cache) > 0: 371 sleep(0.1)

File ~/mambaforge/envs/diem/lib/python3.11/multiprocessing/pool.py:367, in Pool.map(self, func, iterable, chunksize) 362 def map(self, func, iterable, chunksize=None): 363 ''' 364 Apply func to each element in iterable, collecting the results 365 in a list that is returned. 366 ''' --> 367 return self._map_async(func, iterable, mapstar, chunksize).get()

File ~/mambaforge/envs/diem/lib/python3.11/multiprocessing/pool.py:774, in ApplyResult.get(self, timeout) 772 return self._value 773 else: --> 774 raise self._value

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (14,) + inhomogeneous part.

No idea what the problem is, so I'll leave this issue open in case you want to investigate further.

Will let you know how the bigger test goes - thanks again! Simon

— Reply to this email directly, view it on GitHub https://github.com/StuartJEBaird/diem/issues/5#issuecomment-2090309570, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDCV57KNAOEXBN6FO4KDELZAISGFAVCNFSM6AAAAABG2SBLTOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJQGMYDSNJXGA . You are receiving this because you were mentioned.Message ID: @.***>

simonharnqvist commented 4 months ago

Hi

Can confirm: diemR now works as expected on full genome. Regarding output: is there a better option than simply capturing stdout (which I may have forgotten to do...) - I don't see an output path option unlike in the Python API?

Stuart: unfortunately I still get the same error. If that works for you: which version of NumPy are you using?

StuartJEBaird commented 4 months ago

Hi simon,

As Natalia is on holiday I will try to answer for both R and Python:

On Fri, 3 May 2024 at 15:33, simonharnqvist @.***> wrote:

Hi

Can confirm: diemR now works as expected on full genome. Regarding output: is there a better option than simply capturing stdout (which I may have forgotten to do...) - I don't see an output path option unlike in the Python API?

I believe diemr may have tried to create a new output directory in the 'working directory'.

Try searching for filenames like "MarkerDiagnosticsWithOptimalPolarities.txt "

Stuart: unfortunately I still get the same error. If that works for you: which version of NumPy are you using?

Precisely the same error... interesting.

Perhaps just send me your python script for calling diempy - I believe I have the data for chr14 already? Probably easier to work it out from this end.

Bestest

Stuart

— Reply to this email directly, view it on GitHub https://github.com/StuartJEBaird/diem/issues/5#issuecomment-2093033086, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDCV5YSE6FVEG2ZWDKLQW3ZAOG3XAVCNFSM6AAAAABG2SBLTOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJTGAZTGMBYGY . You are receiving this because you were mentioned.Message ID: @.***>

simonharnqvist commented 4 months ago

Hi,

diemR: Hmm - I don't find any files with optimal polarities. Let's wait until Natalia gets back, there's no rush to get this working.

diempy: This is using diem.py from this repo with one modification of L25 - I've commented out:

np.warnings.filterwarnings('error', category=np.VisibleDeprecationWarning)

Because this throws an error for me - which is why I suspect I might be using an incompatible version of NumPy.

Here's the script:

from diem.diempy.diempy_core.diem import diem
import subprocess
import numpy as np

chr14_len = int(subprocess.check_output(["wc", "-l", "diem_files/diem_input/brenthis_ino.SP_BI_364.chromosome_14.diem.txt"]).split()[0])
rng = np.random.default_rng(seed=42)
phi_w = "".join(map(str, rng.integers(low=1, high=3, size=chr14_len)))

diem(PhiW=phi_w,
CompartmentNames=["brenthis_ino.SP_BI_364.chromosome_14.diem.txt"],
     CompartmentPloidies=[  [2]*13 ],
datapath="diem_files/diem_input",
     outputPath="diempy_test", ChosenInds=list(range(13)))
StuartJEBaird commented 3 months ago

Hi Simon!

Natalia is back - and straight into grant assessment. She did have a suggestion however:

On Mon, 6 May 2024 at 11:15, simonharnqvist @.***> wrote:

Hi,

diemR: Hmm - I don't find any files with optimal polarities. Let's wait until Natalia gets back, there's no rush to get this working.

Natalia: "into the working directory. getwd()"

Personally, I think you might have thought of that... is it possible there is an interaction term with running diemr on a cluster/cloud? (eg auto-trash of working area once job is complete?).

diempy: This is using diem.py from this repo with one modification of L25

  • I've commented out:

np.warnings.filterwarnings('error', category=np.VisibleDeprecationWarning)

Because this throws an error for me - which is why I suspect I might be using an incompatible version of NumPy.

At that line I cast the version warning into an error

This gives me some look-ahead regarding version updates - instead of users getting a warning they can ignore, they get an apparent error that they report to me. I can then tell them to comment out L25... and off they go!

Unfortunately this means the version issue is very likely ignorable in your case, and the real error is elsewhere.

Will look into it, and thank you for the help tracing the issue!

S

Here's the script:

from diem.diempy.diempy_core.diem import diemimport subprocessimport numpy as np chr14_len = int(subprocess.check_output(["wc", "-l", "diem_files/diem_input/brenthis_ino.SP_BI_364.chromosome_14.diem.txt"]).split()[0])rng = np.random.default_rng(seed=42)phi_w = "".join(map(str, rng.integers(low=1, high=3, size=chr14_len))) diem(PhiW=phi_w,CompartmentNames=["brenthis_ino.SP_BI_364.chromosome_14.diem.txt"], CompartmentPloidies=[ [2]*13 ],datapath="diem_files/diem_input", outputPath="diempy_test", ChosenInds=list(range(13)))

— Reply to this email directly, view it on GitHub https://github.com/StuartJEBaird/diem/issues/5#issuecomment-2095528242, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDCV55K5CQAW4F47N4345DZA5C3TAVCNFSM6AAAAABG2SBLTOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJVGUZDQMRUGI . You are receiving this because you were mentioned.Message ID: @.***>

simonharnqvist commented 3 months ago

Thanks Stuart and Natalia - indeed, nothing in the working directory; I'm running this test locally on my Mac, so there's no scratch that might get wiped. (But it also means I have a second platform to try it on - I won't run the rest of my species locally anyway)

I'm running it again - maybe the process threw an error before finishing that I didn't notice. Will let you know once that finishes, including a full list of output files.

Simon

simonharnqvist commented 3 months ago

Hi again,

Reporting back from the re-run. As a reminder, this is the call:

library(diemr)

input <- Sys.glob("diem_files/diem_input/*.chromosome_*.diem.txt")

diem(
    files = input,
    ploidy = list(rep(2, 13)),
    markerPolarity = FALSE,
    ChosenInds = 1:13,
    nCores = 6
)

Diem wrote 12 directories to my workdir, listed below. I'm using invalid/pseudo-regex to represent multiple entries.


modelDiagnostics[1-11]
│   OriginalHI.txt
│   RescaledHItobetaOfRescaledHI.pdf   
|   SortedRescaledHybridIndex.pdf
│   WarpSwitch.pdf
|   WarpSwitch.txt

and

likelihood
|   MarkerDiagnostics[0-9]-[0-9].txt
|   MarkersWithChangedPolarity[1-14]

The end of the standard output looks like this:

[99865] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
[99877] FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE
[99889]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[99901]  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE
[99913] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE
[99925] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE
[99937]  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE
[99949] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[99961] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[99973] FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE
[99985]  TRUE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE
[99997] FALSE FALSE FALSE
 [ reached getOption("max.print") -- omitted 947174 entries ]

I don't see any optimal polarities anywhere - but maybe it's still hiding somewhere?

Simon

StuartJEBaird commented 3 months ago

Hi Simon,

Thanks for this very clear summary.

Have a look in the MarkerDiagnostics files:

If there are three columns, then they are the {OptimalPolarity,DiagnosticIndex,Support} tuples for each of your sites.

I have no idea why R is trying to print a million booleans to stdout... Natalia?

More comments/queries below: BOTH: Please correct anything I have wrong...

Best

Stuart

On Thu, 23 May 2024 at 11:40, simonharnqvist @.***> wrote:

Hi again,

Reporting back from the re-run. As a reminder, this is the call:

library(diemr) input <- Sys.glob("diem_files/dieminput/*.chromosome*.diem.txt")

There are multiple compartments of diem-format data: One for each chromosome. (The number of compartments is unclear - but it's a butterfly so around 32)

diem( files = input, ploidy = list(rep(2, 13)), markerPolarity = FALSE, ChosenInds = 1:13, nCores = 6 )

You have chosen to let diemr construct a null polarisation (which it will save for you, Natalia?)

You have specified the ploidy for only one compartment of data: As with diempy,diemmca, the natural ploidy spec is a list of lists, one for each compartment. (Perhaps diemr catches this type mis-match and constructs a list of lists by assuming all compartments have the same ploidy spec... Natalia?)

Diem wrote 12 directories to my workdir, listed below. I'm using invalid/pseudo-regex to represent multiple entries.

12 surprises me. I would expect 1 summary directory (diemr modelDiagnostics) And one directory to gather outputs for each of the chromosomes (diemr likelihood).

Check the time stamps of the summary directories... diemr may be avoiding overwriting older directories, which would suggest this is the 11th time you have tried to get this working....

modelDiagnostics[1-11] │ OriginalHI.txt │ RescaledHItobetaOfRescaledHI.pdf | SortedRescaledHybridIndex.pdf │ WarpSwitch.pdf | WarpSwitch.txt

and

likelihood | MarkerDiagnostics[0-9]-[0-9].txt | MarkersWithChangedPolarity[1-14]

The end of the standard output looks like this:

[99865] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE [99877] FALSE FALSE FALSE TRUE TRUE TRUE FALSE TRUE FALSE FALSE FALSE FALSE [99889] TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE [99901] TRUE FALSE TRUE TRUE TRUE TRUE FALSE TRUE FALSE FALSE FALSE TRUE [99913] FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE [99925] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE [99937] TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE [99949] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [99961] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [99973] FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE TRUE [99985] TRUE TRUE FALSE TRUE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE [99997] FALSE FALSE FALSE [ reached getOption("max.print") -- omitted 947174 entries ]

I don't see any optimal polarities anywhere - but maybe it's still hiding somewhere?

Simon

— Reply to this email directly, view it on GitHub https://github.com/StuartJEBaird/diem/issues/5#issuecomment-2126672271, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDCV57QCVQ5ARLAZPXSZQTZDW2SBAVCNFSM6AAAAABG2SBLTOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRWGY3TEMRXGE . You are receiving this because you were mentioned.Message ID: @.***>

StuartJEBaird commented 3 months ago

Dear Simon,

Can you wait until Tuesday? I’m horribly overwhelmed with project evaluations after my holidays. So now, just a quick reply:

To make sure that I provide ploidies for all compartments, I use this code:

subory = dir("Data", full.names = TRUE) # input files in diem format

inds = c(1:13) # indices of individuals to use

ploidies = rep(list(rep(2, length(inds))), length(subory)) # diploid for all compartments. Must be modified if a compartment is a sex chromosome and individuals have known sex.

CheckDiemFormat(subory, inds, ploidies)

res = diem(files = subory, ChosenInds = inds, ploidy = ploidies, verbose = TRUE)

The polarities are in res$newPolarity or saved to diagnostics/MarkersDiagnosticsWithOptimalPolarities in the working directory with the polarities in the first column as Stuart says.

this code calculates how many markers there are per compartment.

compartment.sizes = readLines("diagnostics/NullMarkerPolarities.txt")[-1] compartment.sizes = strsplit(compartment.sizes, split = " ") compartment.sizes = lapply(compartment.sizes, as.logical) compartment.sizes = unlist(lapply(compartment.sizes, length))

import genotypes

WILL FAIL DUE TO MEMORY FOR LARGE DATASETS!

gens = Reduce(cbind, lapply(1:length(subory), FUN = function(x) importPolarized(file = subory[x], changePolarity = res$markerPolarity[(sum(c(1, compartment.sizes)[1:x])):(sum(compartment.sizes[1:x]))], ChosenInds = inds) ))

The multiple directories are Stuart’s old requirement that we report the results of each iteration (also in res$changedPolarities).

I hope this helps until Tuesday.

Best, Natalia

On 23 May 2024, at 12:32, StuartJEBaird @.***> wrote:

Hi Simon,

Thanks for this very clear summary.

Have a look in the MarkerDiagnostics files:

If there are three columns, then they are the {OptimalPolarity,DiagnosticIndex,Support} tuples for each of your sites.

I have no idea why R is trying to print a million booleans to stdout... Natalia?

More comments/queries below: BOTH: Please correct anything I have wrong...

Best

Stuart

On Thu, 23 May 2024 at 11:40, simonharnqvist @. @.>> wrote:

Hi again,

Reporting back from the re-run. As a reminder, this is the call:

library(diemr)

input <- Sys.glob("diem_files/dieminput/*.chromosome*.diem.txt") There are multiple compartments of diem-format data: One for each chromosome. (The number of compartments is unclear - but it's a butterfly so around 32) diem( files = input, ploidy = list(rep(2, 13)), markerPolarity = FALSE, ChosenInds = 1:13, nCores = 6 ) You have chosen to let diemr construct a null polarisation (which it will save for you, Natalia?)

You have specified the ploidy for only one compartment of data: As with diempy,diemmca, the natural ploidy spec is a list of lists, one for each compartment. (Perhaps diemr catches this type mis-match and constructs a list of lists by assuming all compartments have the same ploidy spec... Natalia?) Diem wrote 12 directories to my workdir, listed below. I'm using invalid/pseudo-regex to represent multiple entries.

12 surprises me. I would expect 1 summary directory (diemr modelDiagnostics) And one directory to gather outputs for each of the chromosomes (diemr likelihood).

Check the time stamps of the summary directories... diemr may be avoiding overwriting older directories, which would suggest this is the 11th time you have tried to get this working....

modelDiagnostics[1-11] │ OriginalHI.txt │ RescaledHItobetaOfRescaledHI.pdf
| SortedRescaledHybridIndex.pdf │ WarpSwitch.pdf | WarpSwitch.txt and

likelihood | MarkerDiagnostics[0-9]-[0-9].txt | MarkersWithChangedPolarity[1-14] The end of the standard output looks like this:

[99865] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE [99877] FALSE FALSE FALSE TRUE TRUE TRUE FALSE TRUE FALSE FALSE FALSE FALSE [99889] TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE [99901] TRUE FALSE TRUE TRUE TRUE TRUE FALSE TRUE FALSE FALSE FALSE TRUE [99913] FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE [99925] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE [99937] TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE [99949] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [99961] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [99973] FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE TRUE [99985] TRUE TRUE FALSE TRUE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE [99997] FALSE FALSE FALSE [ reached getOption("max.print") -- omitted 947174 entries ] I don't see any optimal polarities anywhere - but maybe it's still hiding somewhere?

Simon

— Reply to this email directly, view it on GitHub https://github.com/StuartJEBaird/diem/issues/5#issuecomment-2126672271, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDCV57QCVQ5ARLAZPXSZQTZDW2SBAVCNFSM6AAAAABG2SBLTOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRWGY3TEMRXGE. You are receiving this because you were mentioned.

simonharnqvist commented 3 months ago

Thanks both - this can absolutely wait until next week. I've corrected the ploidy specification and am running again overnight. I'll keep an eye out for a 'diagnostics' directory in the workdir.

Simon

simonharnqvist commented 3 months ago

Can confirm - the verbose=TRUE made the difference. I now have a 'diagnostics' folder - diemR has worked!

@StuartJEBaird - is there a guide/example for how go from the list of indices to the pretty Diem plots?

StuartJEBaird commented 3 months ago

Hi Simon,

(Konrad - if you want to play with this stuff in Mathematica just say and I will send my input files for the cdf below)

Simon: I don't think it was the verbose=TRUE .... more likely the incorrect ploidy spec (feel free to prove me wrong).

But in any case - diemR has worked.

You now have the polarisation stats for very many brenthis sites, the majority of which will be uninteresting. The interesting ones have high diagnostic index (DI).

In which environment will you be working now?

The first thing to do is examine the distribution of DIs over all sites (for all chromosomes).

Much as one would only be interested in ~1% of the genome, if looking for differences between humans and chimps, one is only interested in a (small) percentage of the genome when looking for differences between brenthises. But that percentage is a priori unknown - hence inspect the DI distribution. Here is the distribution for brenthis Chr1:

[image: image.png] When I worked up your chr1 data I arbitrarily decided to work with all sites DI>-30...thats about 46% of sites (which is way more than most systems). (Not completely arbitrary - RH peak is sites with fixed differences, DI>-30 includes a smaller peak... ~sites with introgression).

Natalia has given you her R import code (see passim), which imports and polarises all your original site data. If you work in R, I suggest you add to that code a filter so that you only import DI>SimonsChoice, and then use the diemr visualiser. Be aware: You have way more highDI sites than anyone else, so you may discover 'issues'.

I attach my Mathematica import and visualisation code in a cdf file that can be opened using Wolfram's free MathPlayer. I have added dummyBrenthisNames and a dummyBrenthisBED to show where/how sampleIDs and a bed file for diem-encoded sites are used.

Not sure the BEDfile stuff is yet available in R...but it is important: Check out 'chr'5 in the cdf visualisation - there is a proximal region with a desert of diagnostic sites. One explanation is a selective sweep across the brenthis barrier, such that there is a large region now identical across the barrier, and therefore unpolarisable. I suggest that because of the perturbation in the remaining polarisable sites around the desert. Looks to me like a burst DMI - something I have been discussing with Pierre Nahoud in his data... but I digress.

If you want to code up this post-diem stuff in Python, that would be great! We could add your code to the repository (along with Sam's vcf2diem) Along those lines, we are just about to add the circular diem plot code in Python.

I should say, regarding potential brenthis-specific issues: You have WAY more highDI sites than anyone else...we may need to (lossless) runlength compress/encode your data for decent visualisations...this is one of two tricks we already use to plot circular whole genome diem results.

Best

Stuart

On Fri, 24 May 2024 at 12:24, simonharnqvist @.***> wrote:

Can confirm - the verbose=TRUE made the difference. I now have a 'diagnostics' folder - diemR has worked!

@StuartJEBaird https://github.com/StuartJEBaird - is there a guide/example for how go from the list of indices to the pretty Diem plots?

— Reply to this email directly, view it on GitHub https://github.com/StuartJEBaird/diem/issues/5#issuecomment-2129186878, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDCV56574MPXBUBJ2D4GS3ZD4INVAVCNFSM6AAAAABG2SBLTOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRZGE4DMOBXHA . You are receiving this because you were mentioned.Message ID: @.***>

StuartJEBaird commented 3 months ago

Thanks for all your help Stuart!

The high density of highDI sites will be true for all butterfly species pairs so run-length compression will be helpful in general. Awesome that the circular plotting code will be available in python soon.

@SimonH: totally up to you whether you want to work in R, python or mathematica. The advantage of going for python is that it may the best environment for coding up some of the post-diem filtering stuff in a generally useful way. One thing that is not clear to me is how necessary/helpful kernel smoothing is. I assume that this is only possible in mathematica at the moment?

@ Stuart: I will try my best to explain your tract-length distribution analysis to folk here at the labmeeting next wednesday (12.30 your time). If you have time to join we could chat about strategies for Simon's post-diem analyses.

Best wishes, Konrad


From: StuartJEBaird @.> Sent: 24 May 2024 13:55 To: StuartJEBaird/diem @.> Cc: Konrad Lohse @.***> Subject: Re: [StuartJEBaird/diem] Running diem on whole genome (Issue #5)

This email was sent to you by someone outside the University. You should only click on links or attachments if you are certain that the email is genuine and the content is safe. Hi Simon,

(Konrad - if you want to play with this stuff in Mathematica just say and I will send my input files for the cdf below)

Simon: I don't think it was the verbose=TRUE .... more likely the incorrect ploidy spec (feel free to prove me wrong).

But in any case - diemR has worked.

You now have the polarisation stats for very many brenthis sites, the majority of which will be uninteresting. The interesting ones have high diagnostic index (DI).

In which environment will you be working now?

The first thing to do is examine the distribution of DIs over all sites (for all chromosomes).

Much as one would only be interested in ~1% of the genome, if looking for differences between humans and chimps, one is only interested in a (small) percentage of the genome when looking for differences between brenthises. But that percentage is a priori unknown - hence inspect the DI distribution. Here is the distribution for brenthis Chr1:

[image.png] When I worked up your chr1 data I arbitrarily decided to work with all sites DI>-30...thats about 46% of sites (which is way more than most systems). (Not completely arbitrary - RH peak is sites with fixed differences, DI>-30 includes a smaller peak... ~sites with introgression).

Natalia has given you her R import code (see passim), which imports and polarises all your original site data. If you work in R, I suggest you add to that code a filter so that you only import DI>SimonsChoice, and then use the diemr visualiser. Be aware: You have way more highDI sites than anyone else, so you may discover 'issues'.

I attach my Mathematica import and visualisation code in a cdf file that can be opened using Wolfram's free MathPlayer. I have added dummyBrenthisNames and a dummyBrenthisBED to show where/how sampleIDs and a bed file for diem-encoded sites are used.

Not sure the BEDfile stuff is yet available in R...but it is important: Check out 'chr'5 in the cdf visualisation - there is a proximal region with a desert of diagnostic sites. One explanation is a selective sweep across the brenthis barrier, such that there is a large region now identical across the barrier, and therefore unpolarisable. I suggest that because of the perturbation in the remaining polarisable sites around the desert. Looks to me like a burst DMI - something I have been discussing with Pierre Nahoud in his data... but I digress.

If you want to code up this post-diem stuff in Python, that would be great! We could add your code to the repository (along with Sam's vcf2diem) Along those lines, we are just about to add the circular diem plot code in Python.

I should say, regarding potential brenthis-specific issues: You have WAY more highDI sites than anyone else...we may need to (lossless) runlength compress/encode your data for decent visualisations...this is one of two tricks we already use to plot circular whole genome diem results.

Best

Stuart

On Fri, 24 May 2024 at 12:24, simonharnqvist @.**@.>> wrote:

Can confirm - the verbose=TRUE made the difference. I now have a 'diagnostics' folder - diemR has worked!

@StuartJEBairdhttps://github.com/StuartJEBaird - is there a guide/example for how go from the list of indices to the pretty Diem plots?

— Reply to this email directly, view it on GitHubhttps://github.com/StuartJEBaird/diem/issues/5#issuecomment-2129186878, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABDCV56574MPXBUBJ2D4GS3ZD4INVAVCNFSM6AAAAABG2SBLTOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRZGE4DMOBXHA. You are receiving this because you were mentioned.Message ID: @.***>

The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. Is e buidheann carthannais a th’ ann an Oilthigh Dhùn Èideann, clàraichte an Alba, àireamh clàraidh SC005336.

nmartinkova commented 3 months ago

Hi Simon,

Thank you for your patience. I have now set the outputs in such a way that regardless of the value in the verbose argument in diem, a file MarkerDiagnosticsWithOptimalPolarities.txt will always be created. The new vignette Understanding genome polarisation output files will be available from diemr 1.3.

Best, Natalia