Hippocampome-Org / php

Hippocampome web portal
3 stars 2 forks source link

GAA: [neurites] There is a discrepancy for the number of contacts between the matrix and the tool #518

Closed drdiek closed 4 years ago

drdiek commented 4 years ago

From Giorgio: "Just noticed this during my call with Nate: for DG GC to DG AAC, number of contacts is 2.33 according to the table, but 2.42 according to the tool using the ‘chosen’ parameters (1.09, 6.2, 2). How do we reconcile?"

drdiek commented 4 years ago

@nmsutton I have recomputed the number of contacts given the raw values provided on the Evidence page for DG GC to DG AAC, and I ended up with a value of 2.33, the same as Carolina's value. This leads me to think that there must be a flaw somewhere in the php code. I will investigate further and let you know what I find.

drdiek commented 4 years ago

@nmsutton I am having difficulty printing out intermediate values of variables from connprob.php. I guess I do not know enough about what I am doing. Here is a spreadsheet I created wherein I computed the number of contacts for DG GC to DG AAC from scratch. Perhaps you can use this spreadsheet to check the performance of the php code. number_of_contacts_glitch.xlsx

nmsutton commented 4 years ago

@drdiek Thanks for the work on this. Parameters the tool uses here are: length_axons = 3329.748518 length_dendrites = 1935.557247 volume_axons = 85681348.19 volume_dendrites = 4177247.917

according to your spreadsheet the values used were: length_axons in DG:H (μm) = 3329.74666666667 length_dendrites in DG:H (μm) = 1935.56 volume_axons for A:DG:H (μm³) = 87757340.7366667 volume_dendrites for D:DG:H (μm³) = 8620488.35

I checked DG-Table-2.csv and the tool is accurately reading the values for the neurons even though they are different than the ones you used for volume. Where are you getting your volume values from?

nmsutton commented 4 years ago

@drdiek I have now corrected the github files based on the emails we exchanged. I have also uploaded the files to phpdev. Please check and confirm this issue is now completed.

drdiek commented 4 years ago

@nm It looks like this has been resolved. On a side note, I notice that there are only 3 decimal places used for the probability display, whereas there are 5 decimal places displayed in the synaptic probabilities matrix. Should I create a new Issue for this?

nmsutton commented 4 years ago

@drdiek sounds good, you can create an issue if wanted but the reason that the prob. has 3 decimal places on the tool page is Giorgio asked for it to have 4 significant digits. An email may need to be sent out about if we want the tool display matching the main matrix unless we just want to make it match without an email about that.

drdiek commented 4 years ago

From Giorgio: "However, the tool calculation still mismatch with the tables when choosing standard parameters (1.09, 6.2, 2). For example, semilunar granule to semilunar granule has a probability of 0.01459 in the table and 0.0002790 in the tool."

This is perplexing. I just did the raw calculations for DG Granule to DG Axo-axonic, and the values for the connection probability and the number of contacts match between the tool and the matrix. This suggests to me that the equations in connprob.php are working as expected. The problem therefore lies either in the value for the parcel volume for DG:SMi or in the values for the standard parameters. For the latter, I have an email inquiry out to Carolina to see if she used different values or not.

drdiek commented 4 years ago

@nmsutton I just talked with Carolina, and she confirms that she only used the standard parameters that Giorgio has mentioned. This means that the error lies somewhere else, possibly with the parcel volume values. I will keep investigating.

drdiek commented 4 years ago

@nmsutton OK, I just now clicked on the Synaptic Probabilities matrix for the value of 0.01459 for DG Semilunar Granule to DG Semilunar Granule. This took me to the intermediate table shown below. Notice how the parcel-specific value for DG:SMi is 0.0002800, which is very close to the value of 0.0002790 produced by the tool. However, take a look at the value in the total box. It matches the value from the matrix, but does not correspond to the parcel-specific value at all. Something very, very odd is happening to the summation across parcels here.

Screen Shot 2020-04-10 at 12 51 37

nmsutton commented 4 years ago

@drdiek some follow up here is the way the values are calculated are as follows: SMi: SELECT source_ID, source_Name, target_ID, target_Name, neurite, CAST(AVG(CAST(probability AS DECIMAL(10,5))) AS DECIMAL(10,5)) FROM number_of_contacts WHERE source_ID=1001 AND target_ID=1001 AND neurite='DG:SMi:Both' AND probability!='' GROUP BY source_ID, source_Name, target_ID, target_Name, neurite LIMIT 500000; Total: SELECT source_ID, source_Name, target_ID, target_Name, neurite, CAST(AVG(CAST(probability AS DECIMAL(10,5))) AS DECIMAL(10,5)) FROM number_of_contacts WHERE source_ID=1001 AND target_ID=1001 AND neurite='DG:All:Both' AND probability!='' GROUP BY source_ID, source_Name, target_ID, target_Name, neurite LIMIT 500000; I will look into why there is a difference.

drdiek commented 4 years ago

@nmsutton It turns out there is an error in the underlying spreadsheet from Carolina. She accidentally summed across more cells in her spreadsheet than she was supposed to, and this error propagated to the database, which is where the total values are derived from.

nmsutton commented 4 years ago

@drdiek thanks! Do you think this is an isolated incident or could there be more of such incidents?

nmsutton commented 4 years ago

@drdiek please tell me, how can I find the "standard parameters" for each neuron pair for testing the tool?

drdiek commented 4 years ago

@nmsutton There is the possibility that there are more incidents such as this. I do not know of an efficient way of checking her work. We would essentially have to try to replicate all of the values in columns L and M in number_of_contacts.csv and then run a comparison check across all values.

The standard parameters are 1.09, 6.2, and 2. She used those values for calculating all combinations of neuron types.

nmsutton commented 4 years ago

@drdiek thanks. I have now added those values as default ones in the tool that users can change. Also, Giorgio wrote:

I strongly recommend that you and Diek run a script to iteratively calculate all values of contacts and probabilities for all existing connections in the db using the tool and standard parameters and automatically check for any discrepancy with the corresponding tables...

I will work on that.

drdiek commented 4 years ago

@nmsutton I have uploaded the necessary changes to number_of_contacts.csv and potential_synapses.csv. A new ingest should fix the current error that Giorgio flagged.

nmsutton commented 4 years ago

@drdiek an update on this is that I have written modifications to a test version of the tool to export all connections in text. I have attached a zip (ToolDataSorted.zip) that contains a csv file of the export and a script that imports a csv file into the db for temporary use (since this will not be a part of the main csv2db collection). If you want to use the script you need to change the db_name and user_name in the import script. The script is run (in Linux and mabie OSx also) as: $ ./import_csv_file_csv2db.sh ToolDataSorted If you can not get it to import you could just import the csv file any other way into the db for testing.

Unfortunately, javascript appears to get overloaded with my modifications and errors with "ERR_INSUFFICIENT_RESOURCES" messages. Because of that about half the connection entries never get exported but this should be enough to get started. I will look into how to correct that. The next step I will work on is finding a way in MySQL to compare the new ToolDataSorted table to the number_of_contacts data to look for differences. I would be interested in any ideas you have about how to write a way in MySQL or otherwise to compare the tables' data. Do you know of a way to do that or would you be willing to look into that somewhat?

drdiek commented 4 years ago

@nmsutton I am going to try to make the comparison via Excel. I will let you know how that proceeds.

nmsutton commented 4 years ago

@drdiek how did it go when you tried working on this? I am close to ready to work on this again, please describe where you are at with it and I will try to help continue from there if you are not finished with it.

drdiek commented 4 years ago

@nmsutton I ended just trying to compare the summed totals, because they required the least amount of manipulation to perform a direct comparison. After a first pass, I found that your file had many more entries than Carolina's, which I found odd. The number of entries for DG aligned well, however. One the down side, a majority of the values differed by quite a bit, so I have no idea what is going on.

nmsutton commented 4 years ago

@drdiek conndata.csv has 3120 entries and my file has only 1583 because the javascript got overloaded it appears and didn't output everything. Please explain more what you mean by "your file had many more entries than Carolina's, which I found odd." Do you mean there are neuron1-neuron2 rows in my file that are not in conndata.csv? Which types of values did you see a lot of differences in?

nmsutton commented 4 years ago

@drdiek in looking into this I now realize I labeled the columns with a mix-up where [layer]_prob and [layer]_noc should have been switched. I have attached (ToolDataSortedVer2.zip) the corrected version of the csv file. Sorry about that. Hopefully this helps explain many differences that you were observing.

drdiek commented 4 years ago

@nmsutton Based on the magnitudes of the values, I would say that you got it right the first time and swapped them the second time. Probabilities are much smaller than 1.0, and numbers of contacts are on the order of 1.0.

I made a mistake. You are correct in that your data has fewer records than Carolina's data.

drdiek commented 4 years ago

@nmsutton Here is my comparison spreadsheet (Comparisons.xlsx). I am only comparing total sums. Carolina's values are in white, and yours are in yellow. Column J compares the Source and Target ID's. Column M compares the numbers of contacts, and if the difference between the two sets is <0.01, then the value is set to zero. Column P compares the probabilities, and if the difference between the two sets is <0.0001, then the value is set to zero. Column Q confirms whether Column J contains a "Y," Column M contains a zero, and Column P contains a zero. As you can see from Column Q, over 3/4 of the values do not completely match for all of the DG total-sums comparisons.

drdiek commented 4 years ago

@nmsutton I changed my delta criteria, instead of being a fixed-value comparison, it is a relative-value comparison of 10% of Carolina'a value (Comparisons.xlsx). This has raised the fraction of matching values from from less than 1/4 to closer to 1/3.

nmsutton commented 4 years ago

@drdiek yes, you are correct, I got it right the first time about the [layer]_prob values. Please disregard ToolDataSortedVer2.zip. Sorry for the confusion and I will continue to investigate this issue.

nmsutton commented 4 years ago

@drdiek I did some further looking into the connections by importing conndata.csv into a db and checking the number of connections in the same subregions with this: Query: SELECT * FROM conndata2 as c1 WHERE SUBSTRING_INDEX(c1.Source_Name,' ',1) = SUBSTRING_INDEX(c1.Target_Name,' ',1) Result: 1989 rows

I also worked with the tool code more and found that outputting each subregion at a time allows a bit more output. I have attached (tool_data_sorted_3.txt, rename it to .csv) a file with 1797 connections reported. Therefore as it turns out there are only 192 connections missing (out of 1989) from the output I have so far. We have 90.35 % coverage of the connections in the csv file and I may just manually collect those last 192 connections using the tool rather than working through the resource issue in javascript.

Might it be better to wait for an update on the source data first because you were looking into those files that create the Potential Synapses 4.0.xlsx? If that source data is corrected than validation of all the tool results including those last 192 connections may be more valuable at that point. I could work on other tasks in the meantime. Does that seem like a good plan? Does it seem that source data in DG-Table-1.csv and DG-Table-2.csv, etc., may be the root issue?

drdiek commented 4 years ago

@nmsutton Thank you for the hard and creative work you have been putting into this Issue. Yes, please move on to a different Issue while I continue to check on the original data behind -Table-1.csv and -Table-2.csv. I currently trust the computations coming from the tool more than I trust the transcription of data from Carolina's original spreadsheets.

nmsutton commented 4 years ago

@drdiek you're welcome and thank you also for your great and effective work on this. Before I start on a different task I did some follow-up things and found that with this: Query: SELECT * FROM conndata2 as c1 WHERE SUBSTRING_INDEX(c1.Source_Name,' ',1) = SUBSTRING_INDEX(c1.Target_Name,' ',1) OR (SUBSTRING_INDEX(c1.Source_Name,' ',1) like "%EC%" AND SUBSTRING_INDEX(c1.Target_Name,' ',1) like "%EC%") OR (SUBSTRING_INDEX(c1.Source_Name,' ',1) like "%CA3%" AND SUBSTRING_INDEX(c1.Target_Name,' ',1) like "%CA3%") Result: 2381 rows This corrects for subregion name differences and really 2381 is the total connections targeted.

I did some MySql and Javascript work and came up with a way to automate the output of the last remaining connections. I have attached (tool_data_sorted_4.txt) a new csv version with 1993 connections. This is a first pass and the method can be rerun in additional passes to get the final connections (it also reports which connections are missing). It will work when we have the updated tool results from the revised source data. I am therefore confident that I will be able to produce a dataset for validation with 100% of the connection results from the tool when we have the revised data.

Please tell me if you can use help with correcting the -Table-1.csv and -Table-2.csv data because I view that my work with the tool results output is completed until we get that revised data.

drdiek commented 4 years ago

@nmsutton OK, thank you for the update. I did manage to find a serious error in one of Carolina's original spreadsheets, but it was a systemic error that would affect both the matrices and the tool. I will let you know when I could use some help.

drdiek commented 4 years ago

@nmsutton OK, I think I found a glitch in connprob.php, which may be making the tool calculate incorrectly. I think lines 146-149 actually belong just after line 139. That is, you need to check on the values first and assign potential zeros before counting up the number of non-zero NoC values. What do you think?

nmsutton commented 4 years ago

@drdiek I doubt that is the issue unfortunately because the lines 146-149 control line 156's "(1 / noc_non_zero)" that distributes the fraction of the +1 given to each noc that is a non-zero value. If it were an issue this could be recognized by results with 1, 2, or 3 non-zero noc values having the wrong proportion of 1/1, 1/2, and 1/3 extra values applied to them. Testing by looking at if +1, +.5, and +.333 are added correctly can show if it is working right. "let noc = " is recomupted after the NaN check so that should be adjusted for it.

A question I have is that since we decided just recently how to distribute the +1 with noc, was that also included in the original matrices' values? Could our recent decision about how to distribute that be causing differences in the values between the main matrix and what the tool computes?

Even though I am unsure those lines cause an issue I will test it including moving the NaN check to the earlier line to try to confirm the results.

drdiek commented 4 years ago

@nmsutton We are going to have to agree to disagree on this point. Line 142 tallies up the non-zero values that come from the first calculation of noc in line 140. Without the checks in place, the blank values that are read in for the lengths will not be accommodated appropriately in the equation in line 140. I am zeroing in on this because I have done another hand calculation and have come up with the same value for NoC (1.62) as Carolina. The tool comes up with a completely different value (0.463). The value from the tool does not even make any logical sense, because there are 2 non-zero parcels involved, so the NoC value should be greater than 0.5.

nmsutton commented 4 years ago

@drdiek well I didn't yet check it so you could certainly be right, I was speaking more of a guess than actually testing it. What neuron pair are you referring to that had the 0.463 value? I will test things out now.

drdiek commented 4 years ago

@nmsutton DG Semilunar Granule to DG Total Molecular Layer.

nmsutton commented 4 years ago

@drdiek Well, I tried it with moving the NaN check to 139 but still the same 0.463 appears. I will check further why it may not be recognizing the non-zero values or adding the +1 fractions the right way. Results with the change:

source_id | source                | target_id  | target 
1001      | DG Semilunar Granule |  1004    | DG Total Molecular Layer  

| Smo_prob | Smi_prob | SG_prob | H_prob       | Total_prob |
0   | 0.0006514 | 0 | 0.003575  | 0.004224

|Smo_noc | Smi_noc | SG_noc | H_noc | Total_noc
|0  | 0.463   | 0      | 4.85   | 5.31
nmsutton commented 4 years ago

@drdiek I have found something through debugging and found a way to fix it even though I am not sure why it was occurring. I have now uploaded a new version of the tool to phpdev, please test it and see how it looks. I added if (isNaN(noc)){noc=0;} (as well as adding NaN check after 139) and now it appears to better recognize the non-zero entries. I will shortly share the debug output that let to this change. The output for Smi_noc is now 0.630 which makes sense because it is +.5 from the 0.13 original value. I'm not sure how Carolina's data may have 1.62.

nmsutton commented 4 years ago

@drdiek It is strange that in the formula: let noc = (4 c length_axons[i] length_dendrites[i]) / (volume_axons[i] + volume_dendrites[i]); for SMo this is noc = `(4 4.95 0 520.52 0 0)` and yet somehow noc = NaN. I would not have expected that but it can be corrected by testing for isNaN(noc). I think the tool thought there were 3 noc values and assigned +.333 to SMi that created .463.

Given that SMi is computed by (4⋅4.96⋅562.7⋅514.6)/(2057813.5+42192938.7) = 0.1298278 and the +1 fraction (+0.5) brings this to 0.63, how is this calculation different than the one by hand where you got 1.62?

Debug output without isNaN(noc): SMo: noc: NaN | c: 4.958615217267108 | length_axons[i]: 0 | length_dendrites[i]: 520.5298013 | volume_axons[i]: 0 | volume_dendrites[i]: 0 SMi: noc: 0.12978026336163775 | c: 4.958615217267108 | length_axons[i]: 562.6843516 | length_dendrites[i]: 514.5695364 | volume_axons[i]: 2057813.544 | volume_dendrites[i]: 42192938.71 SG: noc: 0 length_axons[i]: 954.060894 length_dendrites[i]: 0 volume_axons[i]: 44950895.99 volume_dendrites[i]: 44011041.24 H: noc: 4.520044352566802 length_axons[i]: 4898.067483 length_dendrites[i]: 2039.238411 volume_axons[i]: 19947430.46 volume_dendrites[i]: 23882468.66 DG Semilunar Granule,DG Total Molecular Layer, noc: 0.000,0.0006514,0.000,0.003575,0.004224, prob: 0.00,0.463,0.00,4.85,5.31

Debug output with isNaN(noc): SMo: noc: 0 | length_axons[i]: 0 | length_dendrites[i]: 520.5298013 | volume_axons[i]: 0 | volume_dendrites[i]: 0 SMi: noc: 0.12978026336163775 | length_axons[i]: 562.6843516 | length_dendrites[i]: 514.5695364 | volume_axons[i]: 2057813.544 | volume_dendrites[i]: 42192938.71 SG: noc: 0 | length_axons[i]: 954.060894 | length_dendrites[i]: 0 | volume_axons[i]: 44950895.99 | volume_dendrites[i]: 44011041.24 H: noc: 4.520044352566802 | length_axons[i]: 4898.067483 | length_dendrites[i]: 2039.238411 | volume_axons[i]: 19947430.46 | volume_dendrites[i]: 23882468.66 DG Semilunar Granule,DG Total Molecular Layer, noc: 0.000,0.0004790,0.000,0.003456,0.003933, prob: 0.00,0.630,0.00,5.02,5.65

nmsutton commented 4 years ago

@drdiek in case it helps with testing, attached (tool_data_sorted_5_DG_ver2.txt) is a csv file of most DG connections with the new isNAN(noc) update. I have found that I should make separate csv files for each subregion because they have different layer counts and their values should be in different columns than those of DG, that is why I have only included a DG csv here.

Thanks for investigating the NaNs, which has lead to the odd, but at least helpful, isNaN(noc) update. Moving the other isNaN code to line 139, as you recommended, could help with avoiding NaN issues too as I have now seen that there can be NaN issues with the noc_non_zero calculations. I appreciate your help with that.

The Total_noc values are now all greater than 1, which makes sense because the +1 assumed connection should make the values at least that high. It seems like in fact the values are generally closer to the ones Carolina reported based on comparisons that I have checked so far. I ran the new numbers in your comparison spreadsheet and there is now a 45% match, which is an improvement. I have attached (Comparisons_update041420.xlsx) the spreadsheet with the updated values.

drdiek commented 4 years ago

@nmsutton OK, try the calculations again. I found out that the volume values in DG-Table-2.csv had been shifted half-way down the file. Giorgio had a feeling that there were shifted values somewhere.

nmsutton commented 4 years ago

@drdiek I am pleased to say that this has resulted in a 99.12% match (226/228). I have attached (Comparisons_update041520.xlsx) the comparisons file and new tool output (tool_data_sorted_6.txt). Can the same thing be done for the other Table-2.csv files?

drdiek commented 4 years ago

@nmsutton I am in the process of double-checking the other Table-2.csv files. So far, I have found systemic inconsistencies in the CA1-Table-1.csv and CA1-Table-2.csv files, along with the neurites_quantified spreadsheet, which Carolina is now looking into updating. I have yet to check the EC values, but that is next on my todo list.

drdiek commented 4 years ago

@nmsutton EC all checks out.

I am now looking into the 2 mis-matches you found in DG. The first, in row 97 of Comparisons_update041520.xlsx, is due to some sort of mis-copying and pasting of the tool value. When I run the tool, I get a total probability values of 0.0003716, and the matrix has a value of 0.00035976, so they are essentially identical. I will let you know what I find with the second mis-match.

drdiek commented 4 years ago

@nmsutton The second mis-match is between DG HIPROM and DG MOLAX. My hand calculations agree with your total summed values in Comparisons_update041520.xlsx, but the numbers that are produced by the tool are all over the place. For example, the NoC for SMo is 48.0 according to the tool output, which is bonkers, and 1.26 according to my hand calculations.

nmsutton commented 4 years ago

@drdiek I'm not sure where you are getting the 48.0 value. Did you use formulas in the tool to try to recreate that? Below is the debug output, which matches the tool on phpdev output, and the row for the pair in the comparisons xlsx file. The debug output also showed a 1.26 calculation for SMo.

SMo: noc: 0.9225774300153898 c: 4.958615217267108 length_axons[i]: 8905.293333 length_dendrites[i]: 850.165157 volume_axons[i]: 114353949.3 volume_dendrites[i]: 48414080.13 SMi: noc: 0.998354721423595 c: 4.958615217267108 length_axons[i]: 5315.619561 length_dendrites[i]: 1044.66251 volume_axons[i]: 72963555.23 volume_dendrites[i]: 37359282.52 SG: noc: 0 c: 4.958615217267108 length_axons[i]: 4166.23672 length_dendrites[i]: 0 volume_axons[i]: 99764667.88 volume_dendrites[i]: 0 H: noc: 2.8821981348962744 c: 4.958615217267108 length_axons[i]: 10018.07185 length_dendrites[i]: 1732.166525 volume_axons[i]: 75183873.63 volume_dendrites[i]: 44234279.67 DG HIPROM,DG MOLAX, prob: 0.003140,0.004345,0.000,0.009373,0.01677, noc: 1.26,1.33,0.00,3.22,5.81

nmsutton commented 4 years ago

@drdiek in addition to the debug output just posted for the neuron pair, attached (connprob_debug.txt, rename to .php) is a version of the tool that creates debug output for any tool settings entered in case you want that. I can keep providing debug output also it wanted, it is no problem for me.

drdiek commented 4 years ago

@nmsutton Not sure where the overly large number came from, but it is gone now. The last divergence for DG HIPROM to DG MOLAX may come from Carolina's side of things in DG:H.

There is a total revamp of CA1 going on. Carolina made some transcription errors when she produced neurites_quantified, which she has now rectified, but I am afraid that the subsequent values in number_of_contacts and potential_synapses may need revamping, too.

nmsutton commented 4 years ago

@drdiek ok, thanks for the update! Tell me if I can provide further help at any point. When I talked to Giorgio at my meeting it was discussed that this work with the tool (and it sounds like matrices values update also) seems like the last thing needed before the article's submission.

drdiek commented 4 years ago

@nmsutton I was going to suggest that we do a new import with the spreadsheets I just updated, but I notice that the probability values for DG have been changed quite a bit. I had only asked Carolina to update CA1, so I have an inquiry out to her asking about DG. When I hear back, I will let you know.

The values for the numbers of contacts for DG remain good except for the value for DG HIPROM to DG MOLAX. Carolina will be looking into that today.

drdiek commented 4 years ago

@nmsutton Just going to give you an update. Carolina updated her value for the number of contacts between DG HIPROM and DG MOLAX, so now all of the values align between the tool and the matrix for the number of contacts for DG.

I just found out that Carolina adjusted the parcel volume for DG:H, which thereby affects the calculations for the number of potential synapses. I have updated DG-Table1.csv and DG-Table-2.csv accordingly, so you can calculate new tool values. Here is a new comparison spreadsheet ready for your newly re-calculated values to be inserted into the yellow columns: Comparisons_update041720.xlsx

I almost forgot to mention that you also need to perform a new ingest to get the new NxN spreadsheet values into the database, now that I know where all of the changes have originated from.