aidenlab / straw

Extract data quickly from Juicebox via straw
MIT License
62 stars 36 forks source link

how is the output from straw formatted? #36

Closed hyl317 closed 5 years ago

hyl317 commented 5 years ago

My intention is to use straw to extract a contact matrix from .hic file so that $$i^{th}$$ row in the matrix stores the counts of the $$i^{th}$$ bin with all other bins. Trying to understand the format of the output of straw, I used the following command:

result = straw.straw("NONE", "https://hicfiles.s3.amazonaws.com/hiseq/gm12878/in-situ/combined.hic", 
                     str(args.chrom), str(args.chrom), "BP", args.bin)

When args.bin = 2.5Mb, I printed out the first few entries of result[0], result[1], and result[2], by inspecting the result and reading the source code of straw.py, I knew that result[0],result[2] are x,y-axis, respectively, and result[2] is the counts.

[0, 0, 2500000, 0, 2500000, 5000000, 0, 2500000, 5000000, 7500000, 0, 2500000, 5000000, 7500000, 10000000, 0, 2500000, 5000000, 7500000, 10000000]
[0, 2500000, 2500000, 5000000, 5000000, 5000000, 7500000, 7500000, 7500000, 7500000, 10000000, 10000000, 10000000, 10000000, 10000000, 12500000, 12500000, 12500000, 12500000, 12500000]
[1801463.0, 388561.0, 2243387.0, 140674.0, 667793.0, 2797495.0, 69644.0, 90349.0, 443364.0, 3702096.0, 63695.0, 68959.0, 155894.0, 591727.0, 3755324.0, 28924.0, 100082.0, 139273.0, 82155.0, 329323.0]

By tracing the x, and y axis, I guessed that result gives the counts data by iterating the upper triangular matrix column by column (as shown in the first half of the image); however, when I binned at 25kb, the results seem confusing. If results give the full upper triangular matrix, then result[2] should have #ofbins(1+#ofbins)/2 entries; this is the case when binSize=2.5Mb yet not true when binSize=25kb. I tried to trace the x,y labels and get a confusing line as shown in the second half of that image. So I wanna ask how is the upper triangular matrix coded in the straw output? Please correct me if I misunderstand how STRAW works since I haven't read the code very carefully. STRAW

srcoulombe commented 5 years ago

I don't think that the coordinate lists returned by straw.straw are meant to be in a specific order since it's not mentioned anywhere in the docs.

If you want to create an array from those lists, you could take advantage of the scipy.sparse.coo_matrix class and its .toarray( ) method (and then make sure your matrix is triangular/symmetric).

srcoulombe commented 5 years ago
import numpy as np
from scipy.sparse import coo_matrix
import straw

values, index_dim1, index_dim2 = result = straw.straw(
    "NONE", 
    "https://hicfiles.s3.amazonaws.com/hiseq/gm12878/in-situ/combined.hic", 
    str(args.chrom), 
    str(args.chrom), 
    "BP", 
    args.bin
)
as_coo = coo_matrix(
    ( 
        values, 
        (
            [ position//resolution for position in index_dim1 ],
            [ position//resolution for position in index_dim2 ]
        )
    )
)
as_array = as_coo.toarray()

# then handle the matrix format
if np.allclose(np.tril(as_array), as_array): print( "is lower triangular" )
elif np.allclose(np.triu(as_array), as_array): print( "is upper triangular" )
else: as_array = np.tril(as_array + as_array.T)
srcoulombe commented 5 years ago

My bad, those aren't in the right order, it should be index_dim1, index_dim2, values, see line 591:

    return [xActual, yActual, counts]
hyl317 commented 5 years ago

My bad, those aren't in the right order, it should be index_dim1, index_dim2, values, see line 591: return [xActual, yActual, counts]

Hi, thank you! This seems a clever way to get it around. Just one question. In case the as_array is neither upper nor lower triangular, in doing np.tril(as_array + as_array.T), I think we should subtract diagonal elements from one of the as_array so that counts on the diagonal will not be doubled?

srcoulombe commented 5 years ago

Yep!

hyl317 commented 5 years ago

I think I have another question about straw, though. When extracting info at higher resolution, for example, at 500kb, I noticed that, although the chromosome 1 has 499 bins, the length of "values" as in the following code is 104646, smaller than what I would expect from 1+2+...+498=124251. Does straw automatically ignore zeros? index_dim1, index_dim2, values, = result = straw.straw( "NONE", "https://hicfiles.s3.amazonaws.com/hiseq/gm12878/in-situ/combined.hic", str(args.chrom), str(args.chrom), "BP", args.bin )

nchernia commented 5 years ago

yes, straw ignores zeros and only outputs the non-sparse bins.

On Mon, Sep 2, 2019 at 10:59 AM Yilei Huang notifications@github.com wrote:

I think I have another question about straw, though. When extracting info at higher resolution, for example, at 500kb, I noticed that, although the chromosome 1 has 499 bins, the length of "values" as in the following code is 104646, smaller than what I would expect from 1+2+...+498=124251. Does straw automatically ignore zeros? index_dim1, index_dim2, values, = result = straw.straw( "NONE", " https://hicfiles.s3.amazonaws.com/hiseq/gm12878/in-situ/combined.hic", str(args.chrom), str(args.chrom), "BP", args.bin )

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/aidenlab/straw/issues/36?email_source=notifications&email_token=AAK2EW7JSN5PYYGBJ42AONLQHUS5ZA5CNFSM4IOJ3MA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5WBJLQ#issuecomment-527176878, or mute the thread https://github.com/notifications/unsubscribe-auth/AAK2EW44SBL5WZ2NQVCXYTTQHUS5ZANCNFSM4IOJ3MAQ .

-- Neva Cherniavsky Durand, Ph.D. Staff Scientist, Aiden Lab www.aidenlab.org