HobnobMancer / cazy_webscraper

Web scraper to retrieve protein data catalogued by the CAZy, UniProt, NCBI, GTDB and PDB websites/databases.
https://hobnobmancer.github.io/cazy_webscraper/
MIT License
12 stars 3 forks source link

Potentially streamlined scraping #35

Closed HobnobMancer closed 3 years ago

HobnobMancer commented 3 years ago

@widdowquinn I think I've thought of another way to increase the rate of scraping CAZy

Atm, when parsing a protein (our current working protein):

  1. The scraper checks if the protein is already present in the local database by querying by the primary GenBank accession.
  2. If the current working protein has already been stored in the local database (identified by it's primary GenBank accession), the scraper checks that all UniProt, PDB and non-primary GenBank accessions listed for the current working protein are associated with the current working protein in the local database.
  3. If any of the listed UniProt, PDB and non-primary GenBank accessions are not associated with the current working protein then they are added to the local database.

Could we assume that for every family a protein appears in, it's associated data (UniProt, PDB and non-primary GenBank accessions, EC numbers and source organism) are the same? I.e. the data (UniProt, PDB and non-primary GenBank accessions, EC numbers and source organism) presented in the row of the HTML table for a given protein will be the same for every CAZy family HTML table the protein appears in. By applying this assumption the number of queries to the local CAZy database can be significantly reduced. The scraper would only need to check if the current working protein is associated with the current working CAZy family.

An alternative is to add in a streamline-scraping mode that is enabled at the cmd-line and applies this assumption. When invoking this method of scraping, the user would be warned at the beginning that this mode may not retrieve all UniProt, PDB and non-primary GenBank accessions etc., in case of potential inconsistencies in the CAZy dataset or previous errors when retrieving data for the current working protein.

An advancement on the 'streamline-scraping' mode would be to enable users to customise to want extent the streamlining is applied. The user could specify against which criteria the streamlining is applied, for example --streamline uniprot,pdb,ec would apply this assumption to UniProt accessions, PDB accessions and EC number but not source organisms and GenBank accessions. To make life easier I could add in a full option that would apply the streamlining mode to UniProt, PDB and GenBank accessions, EC numbers and source organisms, and thus save the user writing out all 5.

widdowquinn commented 3 years ago

Could we assume that for every family a protein appears in, it's associated data (UniProt, PDB and non-primary GenBank accessions, EC numbers and source organism) are the same?

I can't think of a reason why there should be any difference, in the sense of "the GenBank accession should be associated with the same (for example) UniProt accession."

Where there is a difference, I'd expect it to be because one or more (say) UniProt accessions is not present in the row at the CAZy database; I'm not convinced that the CAZy.org schema has entirely avoided this sort of integrity problem - but if you're confident…

By applying this assumption the number of queries to the local CAZy database can be significantly reduced. The scraper would only need to check if the current working protein is associated with the current working CAZy family.

If you're not wanting to double-check the associated accessions, that could save some time. But I wonder what proportion of proteins this would be, and if the trade-off of - say - avoiding 1% of checks is worth it at the expense of a possible loss of data integrity, if the CAZy.org tables are inconsistent?

Overall I suspect the web/network IO is the slow step, and in-memory database queries are fast by comparison, so the speed-up is possibly not so great. Do you have any numbers to show how impactful this change might be?

HobnobMancer commented 3 years ago

I can't think of a reason why there should be any difference, in the sense of "the GenBank accession should be associated with the same (for example) UniProt accession." Where there is a difference, I'd expect it to be because one or more (say) UniProt accessions is not present in the row at the CAZy database; I'm not convinced that the CAZy.org schema has entirely avoided this sort of integrity problem - but if you're confident…

I agree that it's not the best practice for data integrity. But, I haven't been able to find an instance in CAZy that breaks the rule. In theory, there shouldn't be as the PDB, UniProt, GenBank accession and EC numbers should all be associated with one another or at least a single protein record, which should then be presented the exact same way each time it appears in a CAZy protein table. But that relies on a lot of assumptions about the CAZy schema.

If you're not wanting to double-check the associated accessions, that could save some time. But I wonder what proportion of proteins this would be, and if the trade-off of - say - avoiding 1% of checks is worth it at the expense of a possible loss of data integrity, if the CAZy.org tables are inconsistent?

I added in this assumption so the user can 'customise' it. The user can define which combination of UniProt, GenBank, and PDB accessions and EC numbers to presume are the same each time a given protein appears in CAZy. I set my run to presume UniProt and PDB accessions and EC numbers are the same, but to always check that all GenBank accessions are retrieved for each instance a given protein appears in CAZy, as it's the GenBank accessions that are so important for pyrewton.

The rate of scrape varies from family to family. The time required to parse a protein entry from CAZy is shortest when adding a new protein and longest when adding data to existing proteins. Therefore, families with have a higher ratio of previously scraped to newly scraped proteins have a significantly slower rate. Therefore, it's extremely difficult to accurately predict the total time to scrape CAZy. But approximately, if not applying the assumption that UniProt, PDB and EC numbers are the same for each time a given protein is scraped from CAZy it was looking like the scraper would take 7-8 days (at least!) to scrape the entirety of CAZy. When applying these assumptions (and only double-checking the GenBank accessions) I'm looking at under 4 days.

I think it's a behaviour to in the scrape but strongly highlight in the documentation that it isn't the best to practise for data integrity of the data included in the assumption. It should only be used if performing an extremely large scrape, such as the whole database, multiple CAZy classes or a total scrapping time of over a week, and the assumptions should be applied to a minimum (e.g. only apply to assumption to data that is not absolutely essential to or utilised in downstream processing).

The advantage for extremely large scrapes (e.g. the whole of CAZy) is that it significantly reduces the number of proteins that take 1 one or more seconds to scrape/parse. There are ~2,000,000 protein entries in CAZy to be parsed. If half those take 1 second then that's already 11 days to scrape those 1,000,000 entries.

At the moment:

At that time of not applying the assumption, I had only scraped GH, which is half the database and it had taken over 3.5 days to do, also the average was increasing with each CAZy family. I would expect the second half to potentially slower because of an increase in the ratio of previously scraped to new proteins, as this is the consistent behaviour I've found. At an average of 1 CAZy entry per 0.3s the scrape will take at 7 days, anything greater than that then it's over a week to scrape CAZy. I also had to restart the scrape as I realised the behaviour wasn't correct when it was handling reattempting to connect to CAZy after a connection timed out. I wasn't convinced it was parsing the page after the connection had previously timed out so would have led to a partial scrape of CAZy. Now it's all sorted, but I don't want it to take a week or more to do

The scraper does allow the user to add data to an existing database. Therefore, you can apply the assumption when performing the large data scrape, then if there are specific subsets for which you want to ensure, for instance, you have all the PDB accessions, you could re-scrape just those subsets, adding the data to the database created by the large CAZy scrape. With the logging system in the database, if you shared the database with someone then they can see exactly what you did: if you did or did not apply the assumption, and how it was applied.

It's a poor workaround for speeding up the scraper. But it gets the GenBank accessions I need for pyrewton. Longer-term a better solution for increasing the rate of performing large scrapes on CAZy that don't have a potentially negative affect on data integrity might be worth looking into.

Overall I suspect the web/network IO is the slow step

It certainly is when the CAZy server is having a slow day and it can take 10 or more attempts of having the connection timeout at 45 seconds to be able to reach CAZy.

widdowquinn commented 3 years ago

I haven't been able to find an instance in CAZy that breaks the rule. In theory, there shouldn't be as the PDB, UniProt, GenBank accession and EC numbers should all be associated with one another or at least a single protein record, which should then be presented the exact same way each time it appears in a CAZy protein table. But that relies on a lot of assumptions about the CAZy schema.

I think you'd be able to divine this from your scraped data, if you scraped a "raw" version of the website tables.

It's probably OK to assume that all the data is consistent, but as it's an assumption, it should be clearly stated (and ideally it would be checked, too…)

With that in mind, I'm kind of annoyed with myself that I didn't suggest scraping the site in its entirety as it stands, as a local version on-disk so that you don't have to wait for network connections while developing the database and other logic :(