Nuix / Top-Level-Dupe-Info-Propagation

Annotates top level items and their descendants with information about duplicate custodians and duplicate paths.
Apache License 2.0
0 stars 0 forks source link

memory issue #2

Closed SLICameronl closed 3 years ago

SLICameronl commented 3 years ago

this script worked earlier for me on a smaller data set, but im trying to run it on about 3 million documents currently and i just got the out of memory error. i have 60gb assigned to the application currently. am i able to run the script in chunks? im guessing you need to have everything selected in order for the dedupe to be correct.

JuicyDragon commented 3 years ago

Hello @SLICameronl I figure there are a few places that could be contributing to this.

The script attempts to cache some results in some situations, making a trade off of memory for speed. It could be the cached calculations are filling up the memory.

The script also attempts to apply annotations in 5k item batches rather than all at the end or for every item, which can yield benefits as well.

Both of these are likely culprits. Could also be the cost of the calculations adding up. As I understand it Nuix item collections sort of "hydrate" the items into memory from the stores as you interact with them.

What settings were you using when you had the issue? Knowing that can help me see what logic it was using and see what I can do to try and balance things a bit more.

SLICameronl commented 3 years ago

image

hi jason. those are my settings. ultimately im trying to create a top-level-duplicate custodian field.

the entire data is a little over 3million documents and i am using a command line switch to allocate 60gb to the nuix application.

my temp drive is also a SSD, not sure how big of a difference thats making but just giving ya an overview of our environment.

i was able to get a scripted metadata field to populate the top-level-dupe custodians, but i am unable to get the actual custodian added to that same field. this script accomplished that so it was just a little cleaner for me on export.

thanks!

SLICameronl commented 3 years ago

also the temp drive is 1.5TB

JuicyDragon commented 3 years ago

After testing some things out I believe the issue is with this chunk of code:

if !$current_selected_items.nil? && $current_selected_items.size > 0
    pd.logMessage "Using selected top level items..."
    items = $current_selected_items.select{|i|i.isTopLevel}

    if pull_in_selection_duplicates
        pd.logMessage("Pulling in duplicates of selected top level items...")
        items = iutil.findItemsAndDuplicates(items)
        items = items.select{|i|i.isTopLevel}
    end
else
    pd.logMessage "Using all top level items..."
    items =  $current_case.search("flag:top_level")
end

Which I have updated to instead be this:

if !$current_selected_items.nil? && $current_selected_items.size > 0
    all_top_level_items = $current_case.searchUnsorted("flag:top_level")
    pd.logMessage "Using selected top level items..."
    items = iutil.intersection($current_selected_items,all_top_level_items)

    if pull_in_selection_duplicates
        pd.logMessage("Pulling in duplicates of selected top level items...")
        items = iutil.findItemsAndDuplicates(items)
        items = iutil.intersection(items,all_top_level_items)
    end
else
    pd.logMessage "Using all top level items..."
    items =  $current_case.search("flag:top_level")
end

The key difference being the removal of statements like this:

items.select{|i|i.isTopLevel}

The item collections you get back from Nuix implement interfaces such as List, Collection and Set but the underlying implementation has some mechanisms to page in items (I believe). That statement basically seems to cause the entirety of the collection to be "realized" upfront in memory. This defeats some mechanisms that keeps the larger collections more effecient.

In a 500k item case, running against 70k top level items and about 30% duplicates, this change changed my memory usage from 10,480 MB to 8,020 MB. Hopefully you should see an even better reduction with your larger case numbers. The alternative approach using intersection can have its own drawbacks as well though, so please give that a test when you can and let me know how it goes 😄

Updated release of the script can be downloaded here: https://github.com/Nuix/Top-Level-Dupe-Info-Propagation/releases/tag/v1.16.0

SLICameronl commented 3 years ago

that worked perfect. i checked a few documents and the script looks to be working. fwiw, i ran it on 3.3 million docs and it took 4 hrs 20 mins.

thank you @JuicyDragon

JuicyDragon commented 3 years ago

Glad that did the job. I don't have a sense of how long one might normally expect to it to run on that volume and my guess is it also depends on some factors like how duplicative the data is in the case. I am going to close this for now, but feel free to re-open it or create a new issue if you run into this again.

SLICameronl commented 3 years ago

Hi @JuicyDragon

quick question for you. the very first setting...'also pull in duplicates of selected top level items'. is that only in regards to applying the duplicate custodian values to those documents as well as the originals? i do not need to populate the values for the duplicate docs, so im wondering if i unselect this option would that cut down on the number of documents that nuix needs to go through and apply the value to?

just making sure the deduplication would still work correctly even if that setting is unchecked.

JuicyDragon commented 3 years ago

Hello @SLICameronl

The script iterates a set of input items. For each of those input items it calculates the duplicate values from the items in the case (or uses a cached value) and then applies the result to that item and its descendants. As you noted that setting is basically there so that the duplicates of the top level items you have selected also get those annotations, even if they were not originally in your selection. If you only want the values recorded on the items you selected then you should be good to leave this unchecked.

Where the script will begin limiting the set of items that are used to calculate the duplicate values is when you check the setting Duplicates Must Also be in Selection. When Duplicates Must Also be in Selection is checked, the logic that obtains the duplicates of a given item for calculating the values will include an additional step that will filter out duplicates that were not in your original selection of items.

So for example, imagine I have custodians A, B and C who all have a copy of the same top level item. Now imagine I only select the copies of that item belonging to custodians A and C. When Duplicates Must Also be in Selection is checked, only custodians Aand C should be reported. When Duplicates Must Also be in Selection is unchecked, all three custodians' items would be reported even though only 2 had their item in the initial input.

Hopefully that clears things up, but if not please feel free to follow up with additional questions 😄

SLICameronl commented 3 years ago

@JuicyDragon

that is a perfect explanation and is exactly what i was looking for.

thanks!