Tchanders / InformationMeasures.jl

Entropy, mutual information and higher order measures from information theory, with various estimators and discretisation methods.
Other
66 stars 14 forks source link

GRN inference error when getting nodes. #29

Closed koenvandenberge closed 4 years ago

koenvandenberge commented 4 years ago

Hi,

I'm trying to estimate a GRN on a dataset matrix that can be found here.

However, I get an error in the first step, also when I use uniform widths as discretizer. Any ideas on how to fix this?

# Include packages

using NetworkInference
using LightGraphs
using GraphPlot

dataset_name = string("ExpressionData.csv")
algorithm = PIDCNetworkInference()
threshold = 0.15
@time genes = get_nodes(dataset_name, discretizer = "uniform_width");

ERROR: ArgumentError: collection must be non-empty
Stacktrace:
 [1] _extrema_itr(::typeof(identity), ::Array{Float64,1}) at ./operators.jl:472
 [2] _extrema_dims at ./multidimensional.jl:1601 [inlined]
 [3] #extrema#434 at ./multidimensional.jl:1588 [inlined]
 [4] extrema at ./multidimensional.jl:1588 [inlined]
 [5] get_bin_ids!(::Array{Float64,1}, ::String, ::Int64, ::Array{Int64,1}) at /Users/koenvandenberge/.julia/packages/InformationMeasures/fdfJk/src/Discretization.jl:107
 [6] Node(::Array{Any,2}, ::String, ::String, ::Int64) at /Users/koenvandenberge/.julia/packages/NetworkInference/z8pnG/src/common.jl:32
 [7] get_nodes(::String; delim::Bool, discretizer::String, estimator::String, number_of_bins::Int64) at /Users/koenvandenberge/.julia/packages/NetworkInference/z8pnG/src/infer_network.jl:35
 [8] get_nodes at /Users/koenvandenberge/.julia/packages/NetworkInference/z8pnG/src/infer_network.jl:26 [inlined]
 [9] top-level scope at ./util.jl:175

Note that reading the dataset using CSV works:

julia> dataset = CSV.read(dataset_name)
19×2001 DataFrames.DataFrame. Omitted printing of 1993 columns
│ Row │ Column1 │ E37_5_927   │ E42_7_69    │ E20_7_209   │ E70_2_163  │ E107_6_328 │ E131_7_61  │ E135_3_524 │
│     │ String  │ Float64     │ Float64     │ Float64     │ Float64    │ Float64    │ Float64    │ Float64    │
├─────┼─────────┼─────────────┼─────────────┼─────────────┼────────────┼────────────┼────────────┼────────────┤
│ 1   │ DMRT1   │ 0.00239255  │ 0.0228187   │ 1.7351      │ 1.15105    │ 1.89054    │ 0.00972208 │ 2.0532     │
│ 2   │ FGF9    │ 0.0188525   │ 0.0372795   │ 1.98033     │ 4.01631e-5 │ 1.99057    │ 0.0488233  │ 0.114901   │
│ 3   │ RSPO1   │ 2.20423     │ 1.69426     │ 0.0148483   │ 1.31783    │ 0.00808689 │ 2.02124    │ 0.0190963  │
│ 4   │ DHH     │ 0.000755998 │ 0.0187628   │ 1.75397     │ 0.010211   │ 1.2898     │ 0.0404907  │ 0.017903   │
│ 5   │ CTNNB1  │ 2.77642     │ 2.05514     │ 0.00463818  │ 0.511992   │ 0.004185   │ 2.0759     │ 0.0189984  │
│ 6   │ PGD2    │ 0.00813081  │ 0.0119802   │ 2.12718     │ 0.00934665 │ 2.15162    │ 0.004154   │ 0.0196394  │
│ 7   │ WT1mKTS │ 2.19433     │ 1.3142      │ 1.99798     │ 1.64531    │ 1.7993     │ 2.43534    │ 1.92463    │
⋮
│ 12  │ AMH     │ 0.000717303 │ 0.0275178   │ 1.60484     │ 0.152546   │ 1.29728    │ 0.00579285 │ 0.0252601  │
│ 13  │ NR0B1   │ 2.39157     │ 1.60968     │ 0.0143284   │ 2.28907    │ 0.333701   │ 1.50397    │ 2.16528    │
│ 14  │ NR5A1   │ 0.00334859  │ 0.0156287   │ 1.5703      │ 0.724982   │ 2.08801    │ 0.0162079  │ 1.12717    │
│ 15  │ WT1pKTS │ 0.272286    │ 0.0113008   │ 2.5884      │ 2.27792    │ 1.65995    │ 0.169417   │ 2.32935    │
│ 16  │ FOXL2   │ 1.4524      │ 1.62472     │ 0.000722346 │ 0.00980684 │ 0.00119802 │ 2.0794     │ 0.00196965 │
│ 17  │ UGR     │ 0.0465009   │ 0.000586337 │ 0.00702597  │ 0.0109601  │ 0.0931931  │ 0.00770349 │ 0.00754363 │
│ 18  │ SOX9    │ 0.0166821   │ 0.00586449  │ 2.19017     │ 0.228339   │ 1.24644    │ 0.0035871  │ 1.42638    │
│ 19  │ GATA4   │ 2.30659     │ 2.01715     │ 2.02201     │ 2.2878     │ 1.69821    │ 1.75037    │ 2.0366     │
koenvandenberge commented 4 years ago

This could be fixed using @time genes = get_nodes(dataset_name, delim=',', discretizer = "uniform_width");

However, using a bigger dataset the following error then pops up:

julia> @time genes = get_nodes(dataset_name, delim=',');
ERROR: ArgumentError: indexed assignment with a single value to many locations is not supported; perhaps use broadcasting `.=` instead?
Stacktrace:
 [1] setindex_shape_check(::Int64, ::Int64) at ./indices.jl:258
 [2] macro expansion at ./multidimensional.jl:779 [inlined]
 [3] _unsafe_setindex!(::IndexLinear, ::Array{Int64,1}, ::Int64, ::UnitRange{Int64}) at ./multidimensional.jl:774
 [4] _setindex! at ./multidimensional.jl:769 [inlined]
 [5] setindex! at ./abstractarray.jl:1073 [inlined]
 [6] get_bin_ids!(::Array{Float64,1}, ::String, ::Int64, ::Array{Int64,1}) at /Users/koenvandenberge/.julia/packages/InformationMeasures/fdfJk/src/Discretization.jl:111
 [7] Node(::Array{Any,2}, ::String, ::String, ::Int64) at /Users/koenvandenberge/.julia/packages/NetworkInference/z8pnG/src/common.jl:32
 [8] get_nodes(::String; delim::Char, discretizer::String, estimator::String, number_of_bins::Int64) at /Users/koenvandenberge/.julia/packages/NetworkInference/z8pnG/src/infer_network.jl:35
 [9] top-level scope at ./util.jl:175
Tchanders commented 4 years ago

Hi,

This could be fixed using @time genes = get_nodes(dataset_name, delim=',', discretizer = "uniform_width");

Glad you got this working - the delimiter defaults to tab.

For the error with the bigger dataset: how big is the dataset, and does the error still occur without @time?

koenvandenberge commented 4 years ago

It is fairly large though not huge, ~16K cells and as many genes. The error indeed occurs also without @time. I have shared the dataset here for reference.

koenvandenberge commented 4 years ago

Hi, just checking in whether you have any ideas on whether you have an idea on how this problem may be solved.

Thanks, Koen

Tchanders commented 4 years ago

@koenvandenberge Apologies for the delay, and thanks for reporting. The error was happening because of a deprecation in Julia 7.0.

After fixing, calling @time genes = get_nodes(dataset_name, delim=',', discretizer = "uniform_width"); with the linked dataset worked with no errors.

The size wasn't a problem; the error was only happening if there were genes in the dataset that had all identical values.