corehunter / corehunter3

Core Hunter 3: a flexible core subset selection tool
http://www.corehunter.org
Apache License 2.0
6 stars 5 forks source link

Generating and simultaneous use of multiple distance matrices with CoreHunter3 #114

Open anandksrao opened 10 months ago

anandksrao commented 10 months ago

I am very new CoreHunter or germplasm data analyses. For a new project, I want to use historical climate data in lieu of genotypic / allelic data with CoreHunter

There are 26 climactic / bioclimactic data variables, plus elevation (which usually does not vary significantly over time). As two examples - maximum temperature, precipitation etc.

For each of these 26 variables, the data will be tabulated for ~ 500 plant accessions and for each month of the year. So for each variable, there will be ~ 500 rows and 12 columns (1 for each month of the year)

At http://www.corehunter.org/measures it says "Alternatively, a precomputed distance matrix can be provided by the user."

How can I convert my 500 * 12 table into a distance matrix for use by CoreHunter? But this will be for just 1 of 26 different climate data variables. Therefore, would it be possible to concatenate ALL these tables, one for each of these 26 variables, and then create 1 distance matrix for combined use with CoreHunter3?

Climate for successive months are related to the preceding months (isn't this known as auto-correlation?) and also in some pairwise cases, 2 climactic variables could be strongly (anti)correlated to one another, right? Would any of these statistical behaviors in my dataset require any special methods to create / use the distance matrix(ces)?

My final goal of course, using your CoreHunter3, is to leverage the climactic and bioclimactic data to subset the 500 accessions to 100 accessions (core set) and further condense that to a set of 25 (mini core set) while maximizing diversity across these climactic variables.

Thank you in advance.

hermandebeukelaer commented 10 months ago

Hi @anandksrao,

It is not required to precompute a distance matrix to use Core Hunter for your data. Precomputing distances is mostly beneficial in case you have a large number of variables (much more than the number of accessions) to reduce the dimensionality of the data.

I suggest you use the phenotypic trait data to model your data directly in Core Hunter. You can simply append all columns of the 26 variables into a single file and read this with Core Hunter as phenotypic trait data.

It's difficult to predict the impact of the correlation between variables. It may introduce a bias in the computed Gower distances. Would it make sense to apply delta encoding for the observed values of the same variable at different time points? E.g. if for some accession the observed max temperatures over time were 29, 27, 32, 21, 12, 5, ... a delta encoding would yield 29, -2, 5, -11, -9, -7, ... This would then favor the selection of core accessions with maximally different evolution over time of each variable, rather than maximizing absolute differences at each individual time point.

Let me know if you need more help to get started. Are you using the R package?

anandksrao commented 10 months ago

Hi @hermandebeukelaer Thank you for your kind offer of help. I am attaching an example data subset with 50 lines - 49 with data and 1 header line, as a tab separated text file with txt file extension - it has 112 columns, each with unique colname.

% wc -l Concat_test_file_2023Nov13_50lines.txt  
      50 Concat_test_file_2023Nov13_50lines.txt

% head -n1 Concat_test_file_2023Nov13_50lines.txt | grep -o "\t" | wc -l 
     111

Columns of interest containing categorical data = SPECIES (col 6), ORIGCTY (col 9) Columns of interest used to extract WorldClim data = DECLATITUDE (col 10), DECLONGITUDE (col 11) Columns with extracted numerical data = AMT (col 13) thro' wind.Dec (col 112)

I have not yet started parsing this file using core hunter. Do you recommend I use 'R' on the terminal? I am open to any and all advice / suggestions on how to proceed to create "core" and "mini core" subsets. Thanks in advance.

Concat_test_file_2023Nov13_50lines.txt

hdbeukel commented 10 months ago

I recommend using RStudio. Install and load the corehunter package and then you should be able to load your data as phenotypic traits, after some minor modifications to your data file. See the examples on the Core Hunter website. The first column needs to contain the unique accession ID. I also recommend including a second header row TYPE which specifies the data type for each column: categorical (N = nominal, assuming there is no order) or numerical (O = ordinal).

Can you give it a try and modify your example data file to match these requirements?

hdbeukel commented 10 months ago

Oh and any columns you don't want to use for core selection should be removed from the file.

anandksrao commented 10 months ago

Thanks a lot for your replies, @hdbeukel.

Since my previous post, I found this website from an author of EvaluateCore at this link = https://aravind-j.github.io/EvaluateCore/articles/additional/Example%20Core%20Data.html#setup-the-environment. I could adapt the R code for CoreHunter at this link for my purposes because it was quite detailed. None of the other links I could find online were this detailed, AFAIK.

My run on R terminal (not RStudio) on MacOSX ended with the following lines visible on-screen. Could you please confirm / clarify whether this is the message when the run completes successfully? Thanks in advance.

Search : ParallelTempering stopped after 178.0 seconds and 835 steps
Best solution with evaluation : 0.577745
Best solution with evaluation : Subset solution: {88, 345, 1014, 756, 14, 1316, 565, 757, 604, 580, 953, 4, 1275, 564, 1029, 349, 955, 964, 645, 1046, 1133, 909, 1252, 43, 636, 50, 614, 1004, 219, 1273, 1021, 778, 967, 637, 15, 46, 1093, 1187, 1281, 49, 1199, 16, 82, 806, 695, 624, 934, 575, 873, 437, 89, 766, 1198, 447, 666, 541, 542, 1278, 1332, 1106, 884, 121, 981, 363, 1087, 1322, 399, 699, 528, 718, 101, 1054, 427, 615, 1290, 672, 347, 1271, 1186, 1185, 1151, 236, 714, 724, 1099, 796, 627, 1143, 305, 1080, 61, 508, 831, 617, 904, 229, 633, 390, 1103, 1105, 1164, 941, 1024, 407, 1190, 619, 1139, 145, 1194, 167, 416, 2, 1173, 357, 1219, 33, 12, 642, 21, 430, 477, 1037, 682, 164, 204, 198, 914, 1225, 813, 270, 373, 262, 1184, 124}
hdbeukel commented 10 months ago

Hi @anandksrao,

This seems like a valid output of a successful Core Hunter run.