Move Judge Search to elasticsearch

albertisfu commented 1 year ago

As per our conversation today, we need to move Judge Search to Elasticsearch.

albertisfu commented 1 year ago

@mlissner Related to this one, we talked about applying nested search to make this search better.

So the API results should look something like this?

{
   "absolute_url":"",
   "other-flat-fields-on-person-model":"",
   "positions":[
      {
         "position_type":"Judge",
         "job_title":"",
         "court":"",
         "school":"",
         "apointer":"",
         "other-position-fields":""
      }
   ],
   "education":[
      {
         "school":"",
         "degree_level":"",
         "degree_detail":"",
         "degree_year":""
      }
   ],
   "political_affiliations":[
      {
         "political_party":"",
         "source":"",
         "other-political-fields":""
      }
   ],
   "aba_ratings":[
      {
         "year_rated":"",
         "rating":""
      }
   ]
}

As for the front end, currently it looks like this: Screenshot 2023-06-09 at 11 37 13

Using the nested approach should change to something like this?

Judge Name (Court)
Born:"year and city"
ABA Ratings:
     Rating:"", Year rated:""
...
Positions:
     Position type: "", Job Title: "", Court: "", Selection Method: "", Appointer:"" ... (Which other fields are relevant to show in front end?)
...
Education:
     School: "", Degree Level:"", Degree year: "" 
...
Political Affiliations:
    Political Party: "", Source:"", Date start: ""
...

How many nested objects (e.g: Positions) of each type should we show at once? All of the available or only nested objects that matched the search? Will be the same approach (about displaying nested objects) for the front end and API?

mlissner commented 1 year ago

Yeah, that seems about right. I think we should show all the matched sub-objects when we show search results, right?

mlissner commented 1 year ago

I'm not sure what's best for the API. I guess it should return everything we know about a valid hit?

albertisfu commented 1 year ago

Yeah, for the API, I'm not sure either. If we return everything we know about a hit, wouldn't that be duplicating the people API endpoint?

So, perhaps we should only display the sub-objects that match the search as well?

mlissner commented 1 year ago

Makes sense to me. We want something like a snippet somehow...

mlissner commented 1 year ago

I guess the best approach is to think about what we need and to build that.

albertisfu commented 1 year ago

@mlissner Here are my findings after analyzing the best approach to handle nested Judge objects for search.

The possible approaches to handle these objects in Elasticsearch are:

### Nested field

This field allows the indexing of nested objects under a parent document. These objects can be queried independently since, behind the scenes, these documents are indexed separately.

For instance:

"_source": {
                    "id": 2,
                    "fjc_id": null,
                    "gender": "Female",
                    "name": "John Deer Parks II",
                    "dob_city": "Rebeccafurt",
                    "dob_state": "Louisiana",
                    "dob_state_id": "LA",
                    "absolute_url": "/person/2/john-deer-parks-ii/",
                    "positions": [
                        {
                            "position_type": "jud",
                            "job_title": "",
                            "appointer": "Clinton"
                        },
                       {
                            "position_type": "c-jud",
                            "job_title": "",
                            "appointer": "Obama"
                        }
                    ]
                }

We'll be able to index related Judge documents in a structured way. As a result, we'll be able to retrieve them with the same structure, potentially solving the issue we currently face with the API where related objects do not reflect the database structure.

The advantage of a nested field is that each nested document is indexed independently. These nested objects can also be queried independently. This is useful if we want to filter objects as follows:

Consider the following documents.

1.-

{
"name": "John Deer Parks II",
...

"positions": [
                        {
                            "position_type": "jud",
                            "job_title": "",
                            "appointer": "Clinton"
                        },
                       {
                            "position_type": "c-jud",
                            "job_title": "",
                            "appointer": "Clinton"
                        }
]
}

2.-

{
"name": "Judith Sheindlin",
...

"positions": [
                        {
                            "position_type": "jud",
                            "job_title": "",
                            "appointer": "Clinton"
                        },
                       {
                            "position_type": "c-jud",
                            "job_title": "",
                            "appointer": "Obama"
                        }
]
}

If we perform a search with positions.position_type=jud, it will return documents 1 and 2. However, since positions are indexed independently, we can also filter these documents. For instance, a search with positions.position_type=c-jud AND positions.appointer=Clinton will return only document 2.

The disadvantage of using this field is that documents are indexed independently, which makes queries more expensive.

### Object field

Using this field, we can still preserve the nested structure when retrieving documents. However, the difference with the Nested field is that an Object field will internally flatten the content.

The advantage of this is that these objects can be searched with simple queries, which are faster than nested queries.

The disadvantage is that since objects are internally flattened, we can't perform conditional queries within the inner documents. For example, considering the previous examples, they will be internally flattened as follows:

1.-

{
"name": "John Deer Parks II",
...
"positions.position_type": ["jud", "c-jud"]
"positions.job_title": []
"positions.appointer": ["Clinton"]
}

2.-

{
"name": "Judith Sheindlin",
...
"positions.position_type": ["jud", "c-jud"]
"positions.job_title": []
"positions.appointer": ["Clinton", "Obama" ]
}

So if we perform a search by positions.position_type=jud it'll return documents 1 and 2. But if we do positions.position_type=c-jud AND positions.appointer=Clinton it'll still return 1 and 2, due to fields being flattened we can't perform independent queries within the nested documents.

Checking the filters we currently have for Judge search which are:

Name
Born after, Born before
Birth City
School Attended
Appointed By
Selection Method
Political Affiliation

These filters currently operate over flattened values. As such, they return results that match the provided filters across all possible nested objects, without considering whether they belong to the same nested object.

If we don't need to change this behavior, the best option might be to use the Object fields. This approach will deliver the best performance while retaining the advantage of retrieving results where the nested objects reflect the structure we have in the database, which will improve the API results.

Both the Nested field and Object field require reindexing all the nested objects for a field when a new object is added or updated. However, in this use case, it might not be a problem since Judge-related objects are not too numerous, as in the case of child objects in RECAP.

### Join field type The third method for handling nested documents is using a Join field type, which allows us to query over nested documents using has_child or has_parent queries that enable to match over child documents and also the within fields in the parent document.

The disadvantage of this field is that queries are more costly than simple queries since a join query needs to be executed.

The advantage is that child documents are indexed independently from the main document, which is beneficial for indexing, especially if the parent document has a large number of child objects. As such, we can add, update, or delete new documents in the parent document without having to reindex all the child documents.

While I don't think we need this type of relation for Judges, I believe it might be useful for RECAP. It can solve the problem of being unable to search Dockets without documents. Considering that dockets can contain hundreds of documents, this may be the appropriate approach since we wouldn't need to reindex all the documents for a docket each time a new one is added or updated.

Let me know if you have some thoughts about these approaches.

mlissner commented 1 year ago

Thanks Alberto. My thoughts:

As you explain, object field doesn't fix the problem we currently have, so I think it's out.
Nested field is neat and could work.
Join field is neat and could work.

If the only reason not to do join fields in judges is performance, then that's not a good enough reason. We only have a few thousand judges, they don't have much text, and performance will always be great. Hell, I could put all of their data into a txt file and grep it without any performance problems. It's just not much data. It'll be fast.

The other big advantage of join fields, that I'm surprised you didn't mention (maybe it's too obvious to bother saying), is that it'd be the same solution across RECAP, Judges, etc., making our lives less complicated. I think that's a really big deal.

So I vote for Join fields, but I do have two questions that I want to clarify.

First: Can join fields do multiple joins? For example, in our schema:

Docs FK to...
  Docket Entries, which FK to...
    Dockets, which  FK to...
      Courts.

Will join fields work for this or will we still have to play flattening games?

Number two: How's the memory of join fields vs our current solution? Any idea? I'm thinking about that poor server of ours and whether we can get better performance for RECAP just by using less RAM.

albertisfu commented 1 year ago

Thanks! yeah you're right if we're going to use Join field for RECAP and other document types I agree to use it also in Judges so we can have a homogeneous solution for nested documents across Courtlistener. Also, we can detect now any additional concerns or problems to fix on a smaller scale which can help with working on RECAP.

About your questions:

Can join fields do multiple joins?

Yes we can, but it's not recommended to try to emulate the same structure that we have in the database, since each join level will introduce complexity to queries and increase the cost of queries, from the documentation:

We don’t recommend using multiple levels of relations to replicate a relational model. Each level of relation adds overhead at query time in terms of memory and computation. For better search performance, denormalize your data instead.

The join field shouldn’t be used like joins in a relation database. In Elasticsearch the key to good performance is to de-normalize your data into documents. Each join field, has_child or has_parent query adds a significant tax to your query performance. It can also trigger global ordinals to be built.

The only case where the join field makes sense is if your data contains a one-to-many relationship where one entity significantly outnumbers the other entity. An example of such case is a use case with products and offers for these products. In the case that offers significantly outnumbers the number of products then it makes sense to model the product as parent document and the offer as child document.

So the better is to keep only one level of child documents having a one-to-many relation like in RECAP, one parent docket has many child recap documents. We'll need to continue denormalizing things like the court, and docket entries fields into documents and the main document.

Number two: How's the memory of join fields vs our current solution? Any idea? I'm thinking about that poor server of ours and whether we can get better performance for RECAP just by using less RAM.

There are no precise references regarding the memory impact of using a join field and join queries, apart from the recommendation to avoid using multiple levels of joins to prevent an increase in memory and computational resource usage.

I couldn't find issues related to the use of join fields and memory usage. On the other hand, there are many issues claiming memory problems when using aggregations in Elasticsearch. This would be equivalent to our current grouping solution in Solr for RECAP.

Since aggregation requires holding results in memory in order to perform aggregation based on the specific terms it's a memory-intensive operation, actually, there is a size parameter that I remember I set on aggregations for Parentheticals since by default this is a small value to prevent memory issues (perhaps we'll need to also update Parentheticals to use Join fields if it works smoothly instead of aggregations. Or If there are not many parenthetical groups it might not be a huge impact in terms of memory).

In summary, join queries seem to be more efficient than aggregations in terms of memory usage.

mlissner commented 1 year ago

Wonderful. Sounds like we know what to do!

mlissner commented 1 year ago

@albertisfu, is this one done?

albertisfu commented 1 year ago

Yeah, Judge Search is live in CL!

freelawproject / courtlistener

Move Judge Search to elasticsearch #2810