Implement txpool_pendingTransactions (similar to parity_pendingTransactions)

AskAlexSharov commented 2 years ago

https://openethereum.github.io/JSONRPC-parity-module#parity_pendingtransactions

Probably will need to do in several PR's, add similar method to txpool's grpc interface. Also our TxPool has 3 sub-pools: pending, baseFee, queued. (parity and geth have only pending and queued). This fact must reflect in api also.

We have txpool_content - but it doesn't have pagination (and we can't break compatibility here) - and if transfer 300K transactions in 1 message it's almost 1Gb. We need keep many limits (rpc and grpc) very hight because of this method.

shekhirin commented 2 years ago

Hi! I'd like to work on this one and to start need to clarify one thing regarding OpenEtheruem's implementation. Currently parity_pendingTransactions doesn't have an ability to do proper pagination (only to provide limit parameter) and does have filtering through passing filter parameter. Do we need both of these things (pagination and filtering) on Erigon's side?

Also I think it might a bit confusing from user's perspective to have this method named pendingTransactions since it collides with pending sub-pool name but actually returns various types of transactions from the pool. Also it feels strange to me naming it pending because txpool already holds only pending (in terms of awaiting to be included in the block) transactions. What do you think about something generic like txpool_transactions?

AskAlexSharov commented 2 years ago

We can start with some limited number of features. But having pagination and at least 1 filter parameter is essential. Because we already have txpool_content method which return all txs without pagination and without filters.
"method named pendingTransactions since it collides with pending sub-pool name but actually returns various types of transactions from the pool. " - agree. We don't have to keep compatibility with OE in this method. It's fine to change it - name or request style.
I advise you investigate existing best-practices (in the world) of gRpc pagination or filters implementation (likely we will have limited number of hard-coded in .proto file filters). For example: https://cloud.google.com/apis/design/design_patterns - and design this feature as gRpc-first API. And then add JSON RPC api "somehow" on top of "good gRpc api".
Pool includes "pending" - which already ready to be included to Block - and couple other types of txs - which are not ready to be included to Block (for example user has not enough balance now, but it can change in near future. or txn has too high nonce - can't include it in Block - but maybe txs with lower nonce just didn't came to you from P2P network yet - but they may arrive in near future). Pool design: https://github.com/ledgerwatch/erigon/wiki/Transaction-Pool-Design , pool docs: https://github.com/ledgerwatch/erigon/tree/devel/cmd/txpool
txpool_transactions or txpool_search - everything is fine with me.
I just looked inside go-ethereum: they have new method txpool_contentFrom - single filter by sender - https://github.com/ethereum/go-ethereum/blob/master/internal/ethapi/api.go#L179 - maybe as the first step we can just support this method. It already will cover 80% of use-cases. And this method can't return large response - because there is limit on how much txs from same sender can be in pool - AccountSlots=16. No reason for pagination there. Yes, as the first step - let's add txpool_contentFrom - and while we working on this method - think what user-friendly and grpc-friendly API we can provide to cover last 20% of use-cases.

shekhirin commented 2 years ago

Sounds good. Pagination isn't trivial though: we have transactions stored in BTree that's sorted by (sender, nonce), so it would be convenient to use it for pagination and set next_page_token (as recommended in Google doc) to transaction id. However, it will break if some transaction got evicted from the pool or new transactions will be included in it before the next_page_token transaction id user tries to retrieve.

Regarding txpool_contentFrom: seems like we won't need pagination for it, so something like

message SearchRequest {
  message Filter {
    optional Tx.Type type = 1;
    optional bytes from = 2;
  }
  Filter filter = 1;
}
message SearchReply { repeated Tx txs = 1; }

to start with and extend with pagination and more filters later for txpool_transactions might work fine.

mtgnoah commented 2 years ago

Was this feature ever finished?

AskAlexSharov commented 2 years ago

@mtgnoah no, last PR was abandoned

SozinM commented 1 year ago

Hi! I want to take a shoot on this issue. I've already implemented the search function with a "from" filter. Right now I want to peach my idea of pagination architecture and get some feedback: There are 2 big classes Offset Based vs Cursor Based. Offset based are won't be good if we work with dynamic data so I lean toward Cursor based.

My proposal is to return a page token representing (address, nonce) of the last transaction we returned. This token should be encrypted/signed base64 encoded (address, nonce) or it should be key (uuid) to a map with (address, nonce) (and maybe a filter).

It's better to keep only a one-way cursor to simplify the implementation and as we want only to return data in chunks it would be excessive to have a two-way routing.

There are 2 ways to provide a filter:

With each request. That will require us to validate that filter corresponds with provided page token
With initial request. We will keep the filter in some map and follow-up requests would only contain page_token that will be used to retrieve filter data. (I prefer this approach)

Also crucial to note: After the request has started all data changes behind the cursor would be lost for this request and all data changed after the cursor would be included in the request.

AskAlexSharov commented 1 year ago

@SozinM thank you. I don't understand “final note”. Do you mean: data inside txpool may change between requests and it's fine, we ignore it and user may see same tx on different pages - because it’s position changed inside txpool between requests. ?

my thinking:

There is hard limit on amount of txs per sender: 16. Not so big. No much reason to paginate within 1 sender. So, no reason to accept “nonce” in request.
what is bad with Parity’s interface? What cons do you see?
TxPool main feature: do complex sort of txs - so, miner can get top N txs and be sure that miner’s income is maximized. TxPool has take in account tons of parameters: “is there gap in sender’s nonces?”, “does sender has enough balance for all trx?”, ... And my main concern is: API return txs in some other order - by address, by nonce. Why? Why we drop main feature of TxPool?
TxPool now using senderID everywhere (instead of address) it’s fine, we can just use sorted-map for addr=>id mapping: https://github.com/ledgerwatch/erigon-lib/blob/d2f5682ee80aba9ff3e151e0dc25a53c0c65e691/txpool/pool.go#L275
If use token-based filter - we will use sha256 to prevent other users guess/still your token.
let’s formulate: what user’s use-case we trying to serve by this API? @SozinM share your use-case plz.
more on use-cases. TxPool has 3 sub-pools and one of users use-case is “Why my tx is not in Pending sub-pool?” “How can I fix it?”
if user not provide “sender” in request - then TxPool will need do full-scan over 100K txs. Unclear what secondary index may be useful here - 6 filter fields in parity.

SozinM commented 1 year ago

I think you get me a bit wrong here. I plan to implement the rest of the Parlia filters and all I wanted is to validate my design of the pagination.

I don't understand “final note”. Do you mean: data inside txpool may change between requests and it's fine, we ignore it and user may see same tx on different pages - because it’s position changed inside txpool between requests. ?

There would be no pages, only cursor. If we return the first 100 transactions that fit the filter and the user added more transactions that we placed in txpool behind the current cursor - the user will not see these transactions in this request context. User need to start a new request in order to see these newly added transactions.

Yes, but I will add an additional filter, and thus I will need pagination. It also would be implemented for the "from" filter just for the code to be unified.
Don't see any as of right now, plan to implement it for this call fully.
Are you propose to make a return order based on the filter? Because as I get from txpool doc - all transactions in the txpool (as of right now) are sorted by (address, nonce) and I was not planning to change the order in any specific manner in this function. only apply the filter and return in the order they are in txpool (specifically in txPool.deprecatedForEach function).
I could use senderId instead of an address, no problem here.
It's not to prevent from seeing, but to prevent tampering with as required in https://cloud.google.com/apis/design/design_patterns
One of the use cases I could imagine - find transactions greater that some value - for MEV searchers.
Didn't quite get your point here.
For "nonce" and "from" fields we could use BTree structure BySenderAndNonce in txpool. I don't see any efficient structure to use with other fields right now. We could try to calculate the pool tx would be in based on gas filters and the current baseFee value (Or at least I think we can).

elee1766 commented 1 year ago

tl;dr:

personally, i'm having trouble seeing why someone would want a paginated filtered subset of the txnpool. To me if someone is doing bulk retreival from the txpool, they probably want it all (no filter), and if they aren't, they probably don't need more than some sane amount of entries (otterscan)

In my opinion, the most useful pagination method would be to be able to paginate by offset in order of entry. This would make the task of keeping track of txns from an external application very easy - which i think should be the goal of erigon.

why: because sophisticated MEV players will have their own filters and search methods - so the priority should be an easy way extract the data, rather than provide an interface they wont use. let them build their own complicated index - i dont think that is erigons job

thought process

I see the following as possible use cases for such a method, in order of importance. If i am missing anything let me know

Scanning for possible replay attacks or front-running opportunities (mev)
Getting the most profitable txns + miner bribe (block building)
Indexing the entire mempool (network statistics)

now walking through through of these use cases individually

mev

Sophisticated player worth their salt will keep own index of all pending txns, keeping a local database

one is going to "backfill" their database and then subscribe to notifications for pending transactions, and also probably poll just in case.

in the event that they lose their socket connection, they would want to be able to "catch up" by asking "what transactions have been broadcast in the last . This is because one can use that result set to check "have i received these transactions before?" and if they have not, to increase their interval.

block building

I have not built a block since homestead , so forgive me if my information is outdated.

afaik, builders care either about the highest gas price they can fit into their block, and then include extra transactions based off orders from flashbots or whatever bribe network they use.

network statistics

tracking the mempool state, like etherscan does. In this case, missing a few pending transcations likely isn't a big deal. In reality this is very similar to the mev use case - except missing txns is fine, so one can probably just backfill+subscribe, not worry about reconnet. On the other hand, something like otterscan could possibly want to be able to display "pending txns" for a specific address. - perhaps search is useful there

also

The RPCDaemon already supports the ability to subscribe to pending transactions https://github.com/ledgerwatch/erigon/blob/devel/turbo/rpchelper/filters.go#L37

These are stateful and record all the transactions since you last polled the filter. I'm not a big fan of making stateful endpoints, but thought i would mention that this exists as an option to consider

AskAlexSharov commented 1 year ago

FYI: txpool has GRPC interface: https://github.com/ledgerwatch/interfaces/blob/master/txpool/txpool.proto

AskAlexSharov commented 1 year ago

TxPool has 3 sub-pools: https://github.com/ledgerwatch/erigon/wiki/Transaction-Pool-Design Only txs from “pending” sub-pool are ready for mining.

elee1766 commented 1 year ago

FYI: txpool has GRPC interface: https://github.com/ledgerwatch/interfaces/blob/master/txpool/txpool.proto

yeah, I think the grpc interface should be pushed to those looking to do large scale txpool work

possibly, people are scared because grpc is r/w.

is it easy to subset grpc endpoints into ro/rw methods? expose ro endpoint, might be very useful

SozinM commented 1 year ago

Another use case I see from an infra-provider point of view: The current implementation of txpool_content is very unfriendly with reverse proxies because we need to crank up limits of max_body_size, and timeout limits. It would be great to have an alternative to this call that would fit in default limits (for example default Nginx limits). Also, it's crucial to have filters so our clients do not abuse our egress and get only the data they need.

elee1766 commented 1 year ago

Another use case I see from an infra-provider point of view: The current implementation of txpool_content is very unfriendly with reverse proxies because we need to crank up limits of max_body_size, and timeout limits. It would be great to have an alternative to this call that would fit in default limits (for example default Nginx limits). Also, it's crucial to have filters so our clients do not abuse our egress and get only the data they need.

not sure i agree here.

why does the customer want this service in the first place? if customers are simply using txpool_content to create their own local databases, perhaps the service should be providing compressed txpool dump for this use case.

if a service provider wants to give its users a new method to view txns - i think that should be implemented at the service provider level. Needing to support some specific endpoint for service providers seems like a nightmare that is out of scope for erigon (what happens when need to remove the method to accommodate a major architecture change?)

also, i think it is up to the service provider to sanitize input to their nodes such that customers do not abuse their egress. I think "allowing service providers to blindly forward requests through nginx" is something erigon should not encourage.

even small infra services should be at the very least validating the jsonrpc requests themselves before forwarding to erigon, and ultimately we should be using a combination of the grpc and json-rpc to fully leverage available performance.

AskAlexSharov commented 1 year ago

FYI: txpool has GRPC interface: https://github.com/ledgerwatch/interfaces/blob/master/txpool/txpool.proto

yeah, I think the grpc interface should be pushed to those looking to do large scale txpool work

possibly, people are scared because grpc is r/w.

is it easy to subset grpc endpoints into ro/rw methods? expose ro endpoint, might be very useful

All doable. But we never had this as a goal. Maybe need consider

SozinM commented 1 year ago

@elee1766

if a service provider wants to give its users a new method to view txns - i think that should be implemented at the service provider level.

It was not my point. My understanding is that providing more flexible ways to access data will improve erigon adoption. About the method, I think it makes sense to implement txpool_contentFrom with the param "from" and without pagination. 2 use cases were highlighted in the comments:

Analysing of problems with txns. But it's limited to checking which sub pool your transactions are going to.
Using this method in the otterscan to show the transactions of the address.

Also, I see here a beneficial point that it will bring in the method implemented in ethereum and could help in onboarding for those who are switching from geth client and are using this txpool method.

elee1766 commented 1 year ago

t was not my point. My understanding is that providing more flexible ways to access data will improve erigon adoption. About the method, I think it makes sense to implement txpool_contentFrom with the param "from" and without pagination.

i see what you mean now - misunderstood. that makes sense to me.

SozinM commented 1 year ago

@AskAlexSharov WDYT about this?

AskAlexSharov commented 1 year ago

My opinion: let's do something (at least start work on it) what can be used by Otterscan to overkill Etherscan in UI. Simple txpool_contentFrom is clearly not enough. parity_pendingtransactions is not enough (no order, etc...) but it's clearly way more extensible (can add new filters, orders in future without breaking computability).

Technical details:

If for some filters we have no index now - it's ok to manully-filter on server-side, even if it's low-performance.
It's ok to implement some subset in 1 PR. It's ok to create many preparing PR's. I will try to merge fast.
About "why my Tx is not in Pending sub-pool" - probably it must be another API call which will return deep info about given tx (parity_pendingTransactionsStats ?). But return now at least 1 column like "current sub-pool" of enum: Pending|Queued|BaseFee probably must-have. So, I think that internal API must return all txs (not only pending) and be usual filter.
GRPC API can return RLP-encoded txn(in same format as we send them to network and store in txpool's db). All additional fields can be hardcoded in txpool.proto
I recently implemented something similar, but simpler: Range(Table, fromPrefix, toPrefix, orderAscend, limit)
- Methods are here: kv/kv_interface.go:Range https://github.com/ledgerwatch/erigon-lib/blob/main/kv/kv_interface.go#L319
- This methods return kv.Stream (it's Iterator pattern. such Iterators more composable, can have several of them in same time, can merge them, etc... Also I using them to abstract fact that some data is in "snapshots" and some data is in "chaindata db"): kv/kv_interface.go:Stream https://github.com/ledgerwatch/erigon-lib/blob/fc3dd4fd27895b1448086b5f7f7755e694e55291/kv/kv_interface.go#L460
- And in GRPC it looks as server-side stream with 1 request from client: https://github.com/ledgerwatch/interfaces/blob/master/remote/kv.proto#L198
- I understand that UI always requesting something Limited and GrpcStream seems over-kill here, but such Grpc API can be used not only for UI in future. Also now TxPool has "1 global lock" and periodically release it - is good idea "lock, collect N rows, release lock, send to client, lock, ...".
- Example of conversion GRPC-Stream to kv.Stream: erigon-lib/kv/remotedb/kv_remote.go:rangeOrderLimit https://github.com/ledgerwatch/erigon-lib/blob/main/kv/remotedb/kv_remote.go#LL635C21-L635C36
- If you can follow some of this practices - would be nice, but not a requirement. If you see how to improve my things, you are welcome.

AskAlexSharov commented 1 year ago

@SozinM FYI: here is some example implementation of similar filers in .proto: FilterTree https://github.com/dgraph-io/dgraph/blob/main/protos/pb.proto#L67

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 40 days with no activity. Remove stale label or comment, or this will be closed in 7 days.

github-actions[bot] commented 1 year ago

This issue was closed because it has been stalled for 7 days with no activity.

erigontech / erigon