Save daily RSS texts into a central database table
Determine how often a check is made for new entries from each RSS source. Note that, RSS sources are often updated more than once per day. Thus, a check should also be made if data is not transferred to the database in duplicate.
Checking for new texts should be done a certain number of times per day as defined in configuration file.
handle duplicated data in the context of large datasets:
it would be nice to avoid writing out many rows in duplicate if only some rows of a source have changed -> this would save space
currently this is not possible, from a feed we only have one updated field per source for all rows
Which timestamps do we need/have:
[x] publication date of an entry
[x] insert date of an entry into our database
[x] last update of a source
[ ] last update of an entry??
currently not available, maybe we will have this information for different new sources later
possibly this will be solved later when we have discussed how and when to handle duplicated entries (rows)
[ ] last check of a source: this could be a field that is updated both when we
write new data from a feed and
do not write new data from the feed -> here we would only update the timestamp of "last checked" of the last rows that we wrote into the database for that source
@SarahWagner could you check the ideas above and update if needed? Let me know if you have questions or would like to discuss points!
Considerations for the update of sources:
updated
field per source for all rowsWhich timestamps do we need/have:
@SarahWagner could you check the ideas above and update if needed? Let me know if you have questions or would like to discuss points!