Performance problems with large (~100K obs) datasets with many series - Githubissues

AAVSO / VStar

VStar is a visualisation and analysis tool for variable star data brought to you by AAVSO

https://www.aavso.org/vstar

GNU Affero General Public License v3.0

10 stars 3 forks source link

Performance problems with large (~100K obs) datasets with many series #291

Closed dbenn closed 1 year ago

dbenn commented 2 years ago

e.g. load last two years of SS Cyg data with all bands selected.

Raised in recent CHOICE course: https://www.aavso.org/comment/162798#comment-162798

Leo said there:

Like Ron, I have some questions about performance. Are the times I am experiencing normal, or am I doing something wrong?
Initial load of SS Cyg (all bands): 55 secs
Deselect Johnson V on the Plot Control dialog: 5 secs
Select all bands on the Plot Control dialog: 3 mins 4 secs
Deselect all bands on the Plot Control: 6 mins
Select discrepant: 15 secs

dbenn commented 2 years ago

Profiling with VisualVM is revealing that insofar as the Select All bands problem is concerned, the include() method in the VisibleSeriesRowFilter class is the problem, from around line 102 or 103 onward. The SeriesTyperelated for loop was a relatively recent addition and it is the equals() method in SeriesType and ValidObservation that are taking all the time.

Screen Shot 2022-06-17 at 22 34 21

dbenn commented 2 years ago

Two approaches initially suggest themselves:

disable/remove VisualSeriesRowFilter until after series selection operation completes.
- This would not help the plot control dialog case mentioned in the AAVSO forum link above, just the load.
optimise the include() method.
- This would solve any scenario in which the filter is invoked, whether upon load or plot control series selection. We probably need to think in terms of an O(n^2) -> O(nlogn) or better complexity reduction.

dbenn commented 1 year ago

Okay, so after further analysis, it's lines 104 to 115 that are the culprit:

    Map<SeriesType, List<ValidObservation>> obsMap = Mediator.getInstance()
            .getValidObservationCategoryMap();
    if (obsMap != null) {
        // Or, it could be in another visible series, e.g. a user defined series.
        for (SeriesType visibleType : visibleSeries) {
            if (obsMap.containsKey(visibleType)) {
                visible = obsMap.get(visibleType).contains(ob);
                if (visible)
                    break;
            }
        }
    }

In the presence of many series (which we have when many bands are loaded), this approaches O(N^2), well okay: O(NM), since the line:

visible = obsMap.get(visibleType).contains(ob);

is O(N).

dbenn commented 1 year ago

Looking at the CHOICE course comments (from Leo and Ron), the time blowout is primarily with select/deselect all in the plot control dialog vs loading.

The O(N) line mentioned in the comment above must only be called to check whether an observation exists in some other series not yet checked. The only such series would be user defined series. Therefore, adding a guard to check for whether a visible series type is user defined would make sense.

for (SeriesType visType : visibleSeries) {
    if (visType.isUserDefined() && obsMap.containsKey(visType)) {
        ...

dbenn commented 1 year ago

Another thing that would help is if all occurrences of List<ValidObservation> were LinkedHashSet<ValidObservation> since insertion order is preserved but lookup is O(1) instead of O(N). This is where C/C++ typedefs would be handy in Java.

dbenn commented 1 year ago

Having a set of series (or better yet, series other than that given by getBand()) to which an observation belongs in ValidObservation would allow an O(1) test of whether an observation belonged to another series, such as filtered or user defined. The cost would be a very small amount of memory for Set with one or two members per observation per session. For many observations, that could still be a lot of memory, e.g. 10,000 observations x 4 bytes = ~40K or ~400K for 100,000 observations. I think the cost is worth it though, and is not too exhorbitant. Plus, are 100,000 obs loads common?

dbenn commented 1 year ago

Another thing to note is that I see that even with the current release (2.22.0), the current (and slow) technique for checking whether an observation belongs to another series (user defined) does not have the desired effect on the observation list pane anyway, i.e. the observation does not show up in the list!

dbenn commented 1 year ago

@mpyat2 I have committed a fix for this issue (see last two commits above) but annoyingly, I have accidentally committed all this to the master branch! Apologies.

Would you mind looking at what I've committed and let me know if you have questions?

The key changes are to VisibleSeriesRowFilter and ValidObservationTableModel to properly handle user defined observations. In support of this, there is also a new field in ValidObservation to allow an observation's band to be distinguished from series, in the case of copied, user-defined series, e.g. a user-defined series could have Johnson V observations but their series could be called "My Great Series". Having this permits O(1) checks of membership of an observation in such a series.

I have left the Filtered series check in VisibleSeriesRowFilter as it was though, because whereas there will only be one series called "My Great Series", there can be many called Filtered in any given VStar session, and the current implementation satisfies that fact.

dbenn commented 1 year ago

@mpyat2, this seems to be working as expected; have you seen any problems? Unfortunately because I accidentally committed to master, the only way to see the changes is to look at the 2 commits (see above).

mpyat2 commented 1 year ago

Hi @dbenn , sorry for the late answer. I have seen problems so far, I think the issue can be marked as solved.

dbenn commented 1 year ago

Marking as fixed