Expensify / App

Welcome to New Expensify: a complete re-imagination of financial collaboration, centered around chat. Help us build the next generation of Expensify by sharing feedback and contributing to the code.
https://new.expensify.com
MIT License
3.51k stars 2.86k forks source link

[HIGH] [API Reliability] Prevent chat rooms from being searchable via participant #42989

Closed mountiny closed 3 months ago

mountiny commented 5 months ago

Problem

Coming from here

The SearchForReports command is returning lots of data that we probably do not need to show on the Finder page; the OpenReport command would load anything else we need. This is a big payload that is unnecessarily large for the Finder page.

Coming from here we don't want chat rooms to show up in search if they have participant login/display names matching the search term. This has been called an "anti-feature" + it drastically slows down SearchForReports in a lot of cases when rooms with many participants are returned in the results.

Solution

Let's make sure we send the reports in order to display them on the finder page, and with only as much data, we really need to display them. cc @quinthar

Let's also stop querying for and returning room participant data + personal details.

melvin-bot[bot] commented 5 months ago

Triggered auto assignment to @anmurali (Bug), see https://stackoverflow.com/c/expensify/questions/14418 for more details. Please add this bug to a GH project, as outlined in the SO.

jasperhuangg commented 4 months ago

Note for myself, @marcaaron investigated and posted here that the slowest part of the command is actually the query, so looking into ways to reduce how much data the query returns will probably be the most productive.

Some other ways to speed up the command that were mentioned in that thread:

jasperhuangg commented 4 months ago

Took some time today breaking down the searchByKeywords query, trying to understand each piece, and theorizing a few ways to reduce what the query is doing. I also ran the chat report generator and generated 1000 reports in my local DB to use for testing.

Something interesting that I found was that this portion of the query contributes pretty significantly to the total query time, something like 30%.

(SELECT GROUP_CONCAT(displayName) FROM
    (
        SELECT JSON_EXTRACT(detailsNVP.value, '$.firstName') || JSON_EXTRACT(detailsNVP.value, '$.lastName') AS displayName FROM sharedReports s 
            INNER JOIN nameValuePairs detailsNVP ON detailsNVP.name = 'expensify_personalDetails' AND detailsNVP.accountID = s.accountID 
        WHERE s.reportID = r.reportID 
        UNION ALL
        SELECT JSON_EXTRACT(detailsNVP.value, '$.firstName') || JSON_EXTRACT(detailsNVP.value, '$.lastName') AS displayName  FROM nameValuePairs detailsNVP 
        WHERE detailsNVP.name = 'expensify_personalDetails' 
            AND detailsNVP.accountID = r.accountID 
            AND r.accountID != 0 
    )
) AS participantDisplayNames 
marcaaron commented 4 months ago

I also think something like a getCriticalReportData() that is used in some places instead of getChats() might be helpful?

We can time how long the Report::getChats() part of this command takes by using timestamps and logging the result. I was chatting with @neil-marcellini about this earlier today - but in the context of OpenApp.

this portion of the query contributes pretty significantly to the total query time

Yes, some reports can have a lot of participants in sharedReports so getting their details could be expensive.

marcaaron commented 4 months ago

looking into ways to reduce how much data the query returns

The query itself returns no data - only reportIDs. Report::getChats() is run after to get the data so that's what this ticket is about reducing (not trying to discourage you from optimizing the query though - anything we can learn there would be great).

marcaaron commented 4 months ago

Only returning the actual reports we are going to show the user

I'm not sure about this. Haven't observed it myself. If this is happening it means that the frontend is filtering out results that the backend assumes it should be showing. No idea where the difference is with that.

marcaaron commented 4 months ago

We could conceivably have the client pass the reportID at the “bottom” of the current search result to SearchForReports. The order of the reports would need to be stable between the client and server first (not certain if this is the case today). We should be able to re-design it to work this way.

This would be a larger project I think. It is basically suggesting we implement pagination for LHN searching.

jasperhuangg commented 4 months ago

Got it, I'll look into ways to optimize the getChats query specifically for SearchForReports next, thanks for your insight @marcaaron!

jasperhuangg commented 4 months ago

The query itself returns no data - only reportIDs. Report::getChats() is run after to get the data so that's what this ticket is about reducing (not trying to discourage you from optimizing the query though - anything we can learn there would be great).

Hmm based on some investigating the logs of some slow requests I feel like there are some cases where the query to search by keywords is way too slow.

At most, the query in getChats will be passed 100 reportIDs, so I feel like it won't get that slow.

Here's a request where the searchForKeywords query took 103 seconds: https://www.expensify.com/_devportal/tools/logSearch/?#query=request_id:(%22890147f41fedbc5d-ZRH%22)+AND+timestamp:[2024-06-07T13:22:05.076Z+TO+2024-06-07T15:22:05.076Z]&index=logs_expensify-028822

jasperhuangg commented 4 months ago

Anyways I'll redirect my attention at reducing the data sent back first, seems that's the lower hanging fruit here, although the slow queries are definitely alarming.

jasperhuangg commented 4 months ago

Not overdue, still investigating a solution to this.

jasperhuangg commented 4 months ago

Have a draft PR up. Does some things that help to drastically reduce the number of lines returned in SearchForReports for each report:

This cuts down the number of keys we need to send per report summary by over half.

jasperhuangg commented 4 months ago

^ After some more investigation I found that my approach was wrong.

The real issue with what we're doing in Onyx is we're treating all report types the same. There are many fields (like the "total" or "currency" fields, for example) that don't apply to chat reports at all, but we still send over a field like total: 0, even though it has nothing to do with the report.

I'm going to go through each report type, find the minimal amount of data needed to display it as a search option, and ensure we only send over that data for that specific report type.

jasperhuangg commented 4 months ago

Posted some discussion here

jasperhuangg commented 4 months ago

After analyzing this response:

So my next steps are to:

jasperhuangg commented 4 months ago

Had some great discussion here and we've landed on yet another new approach:

This is not too far off to what I've done currently but I will need to make further modifications to my PRs, specifically modifying the getChats query. I've stuck both PRs that are under review back onto draft mode while I iron this out.

jasperhuangg commented 4 months ago

More progress with the PRs today! Should be close, just need another review

jasperhuangg commented 3 months ago

Just merged the Auth PR

jasperhuangg commented 3 months ago

Auth PR was deployed, this is done!