VEuPathDB / service-eda

Repo containing EDA web service
Apache License 2.0
0 stars 0 forks source link

Ref metadata n-squared perf fix #38

Closed dmgaldi closed 3 months ago

dmgaldi commented 3 months ago

Overview

Updated code to do a hash lookup instead of a full scan to check for a variable's existence.

I'm not 100% sure whether we need to maintain the order in which variables are inserted to the entity, but that's why I opted for LinkedHashMap.

Ran this test, but didn't commit it:

  @Test
  public void testPerformance() {
    EntityDef def = new EntityDef("id", "perf-test", "perf-test-colname", false);
    Instant start = Instant.now();
    for (int i = 0; i < 150_000; i++) {
      def.addVariable(new VariableDef(
          "id",
          "variableId" + i,
          APIVariableType.STRING,
          APIVariableDataShape.CATEGORICAL,
          false,
          false,
          Optional.empty(),
          Optional.empty(),
          "id",
          List.of("a", "b"),
          false,
          null,
          VariableSource.NATIVE
      ));
    }
    System.out.println(Duration.between(start, Instant.now()));
  }
dmgaldi commented 3 months ago

Approved but surprised this would make that much of a difference. This is just a quick var lookup at the validation/configuration of data/compute plugins. Shouldn't be run over and over....

That unit test took 3 minutes to run before the change with 150,000 variables and takes milliseconds after. I think it's because we were scanning the full list when each variable is added. It's not being run over-and-over, it's just high cardinality of variables causes it to fall over seemingly.