To somewhat prevent big files from blowing cache size beyond reason, I've also added a size limit beyond which file contents aren't cached (only the results). If a file isn't cached because it's over this limit and is requested again, we just download it again, without writing it to disk or scanning it again.
Builds on top of https://github.com/matrix-org/matrix-content-scanner-python/pull/17 to cache results and contents of files in a time-based LRU cache so we don't spend our time fetching media from the homeserver.
To somewhat prevent big files from blowing cache size beyond reason, I've also added a size limit beyond which file contents aren't cached (only the results). If a file isn't cached because it's over this limit and is requested again, we just download it again, without writing it to disk or scanning it again.