Hippocampome-Org / php

Hippocampome web portal
3 stars 2 forks source link

GAA: [neurites] There is a discrepancy for the number of contacts between the matrix and the tool #518

Closed drdiek closed 4 years ago

drdiek commented 4 years ago

From Giorgio: "Just noticed this during my call with Nate: for DG GC to DG AAC, number of contacts is 2.33 according to the table, but 2.42 according to the tool using the ‘chosen’ parameters (1.09, 6.2, 2). How do we reconcile?"

nmsutton commented 4 years ago

@drdiek sounds like you and her are making good progress. Well, good news is that the DG HIPROM to DG MOLAX value matches now. There is one entry: 1007 DG Neurogliaform (-)3000p I 1004 DG Total Molecular Layer (-)3303 1007 DG Neurogliaform 1004 DG Total Molecular Layer Y | 3.75288506375008 | 3.75 | 0 | 0.000359759710525 | 0.00025222886 That matches on NOC but not Prob. I will look into the debug code but I would wonder if this could be a misreported Prob value in the source data because none of the other Prob values didn't match when the NOC matches. I will report back what I find. Here is the updated xlsx file: Comparisons_update041720_ver2.xlsx.

nmsutton commented 4 years ago

@drdiek I looked at the calculation and found: prob = (c * ((length_axons[i] * length_dendrites[i]) / volumes_array[i])) / num_contacts[i]; with data from the tool: 0.000252 = (4.959 * ((3487.72846 * 520.5298013) / 9518233055)) / 3.75 this seems to agree with the tool value used in the comparisons xlsx file so I don't know why Carolina's work got Prob of 0.000359759710525 for that. I think it may be an issue in the source data.

drdiek commented 4 years ago

@nmsutton I have an inquiry out to Carolina about her calculation for the number of potential synapses, which differs from the value I have calculated by hand. This value affects the final value for the probability, which matches between my calculation and the tool.

While we are waiting for Carolina to get back to us, can you try to generate the tool values for CA3, so we can check on those comparisons? Right now the tool does not generate as many values for CA3 as Carolina's spreadsheets.

nmsutton commented 4 years ago

@drdiek here is a file tool_ca3_combine_sorted.xlsx with more CA3 values. It has 77% (327/423) of the values and I am still working out a method to get the last 23%. For some reason my second iteration of collecting missing values didn't work right away so I will need to look into how to make sure to get those.

drdiek commented 4 years ago

@nmsutton You actually have all of the pairs of neurons that Carolina's data has, plus some extras. The tool is apparently outputting values for CA3 Axo-axonic, CA3 Basket, CA3 Basket CCK+, and CA3 Horizontal Axo-axonic when they are the first of the pair, and these 4 types should in fact be excluded from being the first in the pair.

drdiek commented 4 years ago

@nmsutton It is not looking good. A majority of the values are off in CA3 (Comparisons_update041820.xlsx). It is hard for me to imagine where to begin to piece this apart. I will start in on it later this evening after dinner.

drdiek commented 4 years ago

@nmsutton While I am fretting over CA3, can you work on generating the values for the rest of the subregions? I am morbidly curious to see how they match up.

drdiek commented 4 years ago

@nmsutton Update. I found that some of the values in CA3-Table-2.csv are wrong, which would throw off the values for the numbers of contacts, which would, in turn, affect the probability values. I am now awaiting an updated file from Carolina.

nmsutton commented 4 years ago

@drdiek thanks for looking into this and I will work on producing the rest. I also updated the phpdev server just now with the latest CA1 data files and updated db that was generated based on the latest csv2db code. I previously updated it with the lastest DG data files. This may help with any spot checking you are doing on the server if you are still having challenges getting a local site version working on your home system.

nmsutton commented 4 years ago

@drdiek here are CA1 tool_ca1.xlsx and CA2 tool_ca2.xlsx. It seems to me no connections are missing from them but please report back if you end up finding missing connections.

nmsutton commented 4 years ago

@drdiek I realize now there were some duplicates in the original CA1 and CA2 I just sent. Here are all the remaining subregions without duplicates: CA1 tool_ca1_2_distinct.xlsx CA2 tool_ca2_distinct.xlsx EC tool_ec_distinct.xlsx SUB tool_sub_distinct.xlsx. If you end up finding any missing connections please tell me and I will collect them.

drdiek commented 4 years ago

@nmsutton CA1 is missing 219 entries compared to Carolina's data, and EC is short 37 entries. CA2 and Sub totally check out.

nmsutton commented 4 years ago

@drdiek thanks. The positive news is that I know exactly which each of those entries are. The negative news is for some unknown reason my automated export code just doesn't want to export them. I will look into other ways to semi-automate getting them and if needed I will manually collect them.

drdiek commented 4 years ago

@nmsutton Carolina sent a new value for DG Neurogliaform to DG Total Molecular Layer, so we are now at 100% for DG. She also sent new files for CA3 and EC, which I have updated on the server. Please recompute the tool values for CA3 and EC at your convenience.

nmsutton commented 4 years ago

@drdiek I have discovered that the missing 219 entries and 37 entries (and others) are partly or entirely due to missing neuron types in neuron type-Table files. For example "CA1 Hippocampo-Subicular Projecting ENK+" is not in CA1-Table-1.csv and CA1-Table-2.csv yet is a part of the neuron types in the missing connections. Therefore, there is no way to get tool values without that. Shall I make a list of all missing neuron types or is there a way to just check in the file creation where a bug may be that causes the lack of entries?

drdiek commented 4 years ago

@nmsutton Carolina just got back to me. She has double checked and none of the CA1 types are missing. Some EC types may be missing due to a lack of reconstructions upon which to base her data mining. I just spot checked, and CA1 Hippocampo-Subicular Projecting ENK+ is is the Table files (lines 77-78). Therefore, your script must be missing some neuron types for some reason.

nmsutton commented 4 years ago

@drdiek ok thanks, I see that now, apologies for missing it earlier. One thing I found is that in the conndata.csv entries and in the *-Table files there can be slight differences in the characters used in the names. For example "CA1 Hippocampo-Subicular Projecting ENK+" in the conndata and "CA1 Hippocampo Subicular Projecting ENK+" in the tables files, that is probably causing a mismatch (one has a hyphen and one does not) in the tool and making it not recognize such a type as that. I will work on updating the names in the tables files and see how much it fixes the exporting of connections.

drdiek commented 4 years ago

@nmsutton If you worked with the Unique IDs instead of the names, this would not happen.

nmsutton commented 4 years ago

@drdiek Nikhol wrote the code using names not ids, it could be rewritten in the future but I imagine the highest priority is just getting the datasets needed now. That said, it is good advice in general that you give about using the IDs.

drdiek commented 4 years ago

@nmsutton Where do we stand currently? I have a meeting with Giorgio later this afternoon, and I would like to be able to give him an update.

nmsutton commented 4 years ago

@drdiek I was able to get the missing "CA1 Hippocampo-Subicular Projecting ENK+" connection to be successfully reported by the tool when the hypen was added. This leads me to think that perhaps all the missing connections could be recovered by similar methods but I will need to try this further. I just got out of a 3+ hour class so I haven't had much time to try to work on this since that. I see your meeting is at 2:30pm, I had wanted to try to kill two birds with one stone by updating the missing connections than exporting the updated CA3 and EC after that but if it is taking too long I'll just export CA3 and EC without that. I'll work on getting this done by 2pm.

nmsutton commented 4 years ago

@drdiek EC: tool_ec_distinct_2.xlsx

nmsutton commented 4 years ago

@drdiek CA3: tool_ca3_distinct_2.xlsx

nmsutton commented 4 years ago

@drdiek these are exported values with the updated *table files. I did not have time to fix missing connections.

drdiek commented 4 years ago

@nmsutton One of the missing types is EC LII Basket-Multipolar Interneuron (-)230000, which appears to have extra spaces in the name between LII and Basket in Carolina's listing. Do we have to go in by hand and correct all of the types that are missing?

nmsutton commented 4 years ago

@drdiek well each type can have about 10 connections associated with it at times so with CA1 having 219 missing values that may mean 22 find/replace of the type names in *table files which I predict will be faster than recoding Nikhol's code. I have the intuition to try that first but if there are issues I will recode it. I am really just trying to do what is fastest right now.

drdiek commented 4 years ago

@nmsutton The newest EC values were calculated with the newest EC-Table files, correct? I ask because the NoC align well for the types that line up, but the probability values are way off. This makes me think that there is still a problem with the parcel volumes at the top of the EC-Table files.

nmsutton commented 4 years ago

@drdiek yes that is the newest EC tables

drdiek commented 4 years ago

@nmsutton OK, then I will have to hand check some values to see where the fault might lie. I will get back to you sometime after my meeting with Giorgio.

nmsutton commented 4 years ago

@drdiek ok thanks

drdiek commented 4 years ago

@nmsutton OK, I just realized a subtlety of calculating values for EC. The EC is comprised of two parts: LEC and MEC. Some neuron types are found in both parts, so the names begin with "EC." Otherwise, the types begin with "LEC" or "MEC." The parcel volumes used in the calculations for the probabilities need to reflect this division of EC. For example, MEC LV Pyramidal only uses the MEC portion of the parcel volumes listed at the beginning of the EC-Table files, whereas EC LI-II Multipolar-Pyramidal uses the "Sum" portion of the parcel volumes. What this boils down to is that the code for the tool will need to be modified so that all 3 rows of parcels volumes are read in and then the correct row of values will need to be assigned to the neuron types appropriately. Have I explained this well enough?

nmsutton commented 4 years ago

@drdiek something different about the EC-Table file are that they are the only ones with:

  | LEC | 5341701801 | 7830471152 | 5618250778 | 3056571053 | 5606164096 | 4900219897
  | MEC | 2576770345 | 3628376490 | 2575731103 | 377361465.5 | 773049991.8 | 1282619658
  | Sum | 8915625392 | 12554182656 | 8912029617 | 1305670671 | 2674752972 | 4437864017

Yes, I see that at the top. I will try to update the tool. Does it look like the probability calculations would be right if that was adjusted for?

nmsutton commented 4 years ago

@drdiek For my meeting with Giorgio also, I know DG fully matched so far but how just in a general sense close are the other regions in matching currently?

drdiek commented 4 years ago

@nmsutton Yes, at least for the MEC LV Pyramidal to MEC LV Pyramidal case, which is the first one listed. My hand calculation has determined that for the overlap in LI, the correct value to use is 2576770345 and not 8915625392.

I told Giorgio that we are two days away from aligning all of the subregions.

nmsutton commented 4 years ago

@drdiek thanks, if you don't mind, can you please share how close are each of CA1, CA2, CA3, EC, SUB currently as a general approximation for each?

drdiek commented 4 years ago

@nmsutton DG, CA2, and Sub are completely matched. CA3 and CA1 are about 50%. EC is the farthest away because of the needed modifications to the php code.

drdiek commented 4 years ago

@nmsutton The latest version of the CA3 values you sent are missing entries for CA3 Lucidum-Radiatum (-)03300 as the first of the pair. It does appear in at least one instance as the second of a pair.

nmsutton commented 4 years ago

@drdiek thanks! I will look into these things, good to be in sync about what to focus on - CA1, CA3, and EC.

drdiek commented 4 years ago

@nmsutton CA3 is also missing entries that begin with CA3 Interneuron Specific Oriens (-)01113.

drdiek commented 4 years ago

@nmsutton Once you get me those missing CA3 entries, and I double check the values against Carolina's, you will need to perform a new database ingest to get the latest updated matrix values. At this point all of the CA3 entries that I have access to now match up.

nmsutton commented 4 years ago

@drdiek some good news, I may have fixed the EC issue. Nikhil did have code to select which volume to use it just didn't work right. I adjusted some if statements and now it appears to be providing good results. Attached is the updated values, please tell me how they look: tool_ec_distinct_3.xlsx

nmsutton commented 4 years ago

@drdiek in addition, you asked about "CA3 Lucidum-Radiatum (-)03300" but "CA3 Lucidum-Radiatum" is on the exclude list. You also asked about "CA3 Interneuron Specific Oriens (-)01113", "CA3 Interneuron Specific Oriens" is on the exclude list. You also asked about "EC LII Basket-Multipolar Interneuron (-)230000", "EC LII Basket-Multipolar" is on the exclude list and that is the closest name description to a similar neuron type in the EC-Table files, so I think that includes the interneuron unless I am mistaken.

Thanks in any case for looking into this. Should any of these not be on the exclude list? You wrote once those missing values are retrieved I should do a new database ingest. Should I do a new database ingest now, or should I keep trying to get other missing connections before that?

drdiek commented 4 years ago

@nmsutton Carolina definitely has values for "CA3 Lucidum-Radiatum (-)03300" and "CA3 Interneuron Specific Oriens (-)01113," so I would remove them from the exclude list. I would also remove "EC LII Basket-Multipolar Interneuron (-)230000" and "LEC LIII Complex Pyramidal (+)233310."

This just leaves a handful of conflicting values for EC, which I have to trace by hand.

drdiek commented 4 years ago

@nmsutton I believe Carolina mis-added the parcel sums for EC, so all of the pairings with an "EC" name are off. Please recalculate the EC tool values with the updated EC-Table files.

drdiek commented 4 years ago

@nmsutton I found a possible major mistake in Carolina's EC values. I am asking her to double check my calculations.

nmsutton commented 4 years ago

@drdiek I read that you asked for some neuron type to be taken off the exclude list. With the CA3 neuron types you mentioned you also included they where in the presynaptic part of the connections. As a friendly reminder, the exclude list works by only removing neuron types from the pre part but not the post part. If the types are in Carolina's data it could be due to them being included as post connections but not pre. With the EC types you mentioned, I did a little checking but could not tell if the missing ones for those were in pre or post position. Were the specific EC ones you observed missing in pre or post?

As a suggestion, with all due respect, would it make sense to check perhaps with Carolina about if those should be removed from the exclude list, because them being present in her data is explained by them being possible post connections but she may have intentionally excluded them from pre. I don't recall the specific reason she gave for why the exclude list was created. If you tell me to just take them off the list I will do that, just wanted to make sure we are on the same page. Thanks in advance.

drdiek commented 4 years ago

@nmsutton I have an email inquiry to Carolina about the two CA3 types. "EC LII Basket-Multipolar" is supposed to be on the exclude list, but it is also being excluded as a post neuron type. "LEC LIII Complex Pyramidal (+)233310" is not on the exclude list, but it is being excluded both pre and post, and it should not be excluded at all, because there are definitely values.

nmsutton commented 4 years ago

@drdiek looks like another hyphen issue with "LEC LIII Complex Pyramidal", I see it listed as "LEC LIII Complex-Pyramidal" in the Table files. I will correct that and look for similar issues.

nmsutton commented 4 years ago

@drdiek the hypen fixed "LEC LIII Complex Pyramidal" but I found an issue where "EC LII Basket-Multipolar" is listed 2 different ways in conndata.csv. In the post column it is listed as "EC LII Basket-Multipolar Interneuron" with the same id. I would imagine this is causing problems with the tool. This is getting complex enough that it seems it may be better at this point to take your advice and rewrite the tool to use ids rather than names. Manually finding all these issues seems time consuming. I will try to do the coding and report back how it goes.

nmsutton commented 4 years ago

@drdiek I have found a way to use the IDs instead of names it appears. I ran the new code on the EC data. Please tell me if any remain to be missing in the new file: tool_ec_distinct_4.xlsx. 37 missing entries were added which is the same amount you said were counted as missing earlier.