Closed leonawicz closed 1 year ago
Hello, Matt. Thanks for reaching out.
I'm happy that STAPI found another application. I cannot offer any feedback on the code level, because R is a mystery to me, but I would really like to see what interesting statistical data can be discovered using your project, because I'm no good at statistical analysis myself.
Thank you for remembering to throttle your requests!
Actually I think the version 1 of API is stable and there will be no changes to the API interfaces. In the future, some services can gain a second version with some additional data inside, but those will be expose at a different path, like /api/v2/rest/character/search. I will make appropriate changes to the documentation shortly.
I figured probably no one working on STAPI would be familiar with R but mostly I wanted to make sure I wasn't accidentally interacting with the API in some improper way or neglecting some other consideration. But it sounds like everything is good!
I will update my docs regarding the version 1 API stability. This makes me more comfortable about publishing the R package more formally soon :)
Also, my library has the MIT license so please feel free to add any datasets I curate for it directly to your database. Meaning, you don't have to dig them out of the library; I could just provide them to you in a universal format like csv if I end up producing something you are interested in.
The most interesting thing I have planned at the moment is a table of Star Trek novels summary data. It will include a number of metadata columns such as title, author, publication date, etc., but then the more interesting data columns would have everything from simple summary variables like chapter and word counts, to things like results of text mining done on the books, maybe sentiment analysis, relative occurrences of popular characters' names in each book. It would be interesting data that could allow people to do anything from choose what to read based on what books focus most on their favorite characters (or have a function recommend a filtered book list to them) or analyze relationships between some of the popular authors and what the focus of their Trek writing tends to cluster around.
Anyway, in time there is the potential for a lot of good stuff!
I'll be sure to look at your data and see what could be integrated into STAPI (and I'll be checking in the future as well). Athough I'm not sure if data derived exclusively from STAPI should be reintegrated into it. But data derived from other sources, or from multiple sources, including STAPI - sure, why not. I hope that version 1 won't be the last!
The idea of analyzing every ST book content sound really exciting, but obtaining every book in a right format is quite a challange. BTW, do you know the website Star Trek Minutiae? They host all the scripts from all the ST movies and series, except for Discovery. Maybe it's an interesting material to analyze as well?
I don't have any worthwhile data available yet. I agree it does not make sense to include in STAPI anything derived from STAPI data.
Whoa! I was not aware of that site. Thanks for sharing that. I will gladly expand the scope of the text mining to include episodes!
Books
The Star Trek books won't be too much trouble because I've been reading the epub's for years. And other formats I can convert to epub as well. I have some R code to read and parse epub structure and text content directly I've been playing around with recently. The big question will be how similar will things like book metadata be across books. I haven't looked at them all (not even close), but so far things are okay.
Parsing the text is the easier part in many cases. That's more of a blunt instrument. All I need to obtain is a "bag of words" and it doesn't matter if it's a bit messy. But if I want to actually associate the results of my text analyses on each book with the book title and author for example, then I need a generalizable method for obtaining that metadata without any "guessing" from the text, which would be impossible given the variability in book styles. Fortunately, so far it is looking like it's easy enough to pull those fields directly out of a file's metadata without fussing with the text content. I'm hoping as I look at more book files, there won't be too much inconsistency.
I don't have all of the very newest books but that's okay. And I will definitely be purchasing them when I'm ready to read them.
Timeline applications
One other thing I am highly interested in is a complete timeline data table of some kind, that includes the chronological stardate entries for all the episodes and movies but also the books. You can see this at the bottom of the page for Memory Beta entries. There is also a published timeline like this in the "Voyages of the Imagination" fiction companion by Jeff Ayers. But I don't know how to scrape or otherwise obtain such timeline data from Memory Beta. But it would be amazing to have historical events, episodes, movies and books have their respective stardates associated with them in a two-column table to start, and eventually maybe expand it to include a column indicating alternate branching timelines. It would be cool if I could make some websites/apps that ingest timeline data and map it out visually for people in an interactive fashion. But I have no idea how to pull that kind of data out of Memory Beta.
Timelines are great, but I'm afraid Memory Beta does not have that much data systematized. When I was writing code to parse Memory Alpha content, sidebar templates were a primary source of well structurized data. Page categories were a secondary source. Some of novels on Memory Beta have stardates, date, or both, but not all of them. If I was to pull that kind of data from Memory Beta, I would go over every infobox that is related to novels and look for specific fields in it. MediaWiki public API, although not perfect, can be used to do that.
Actually when I was starting with STAPI, I imagined someone at some point would construct an interactive timeline, where different events from the universe could be applied. But to be honest, at least in STAPI, there is much more data about real world events, like episodes first airing dates, video releases release dates, birth and date dates, and so on, that on fictional events.
I wonder then if the Jeff Ayers book actually contains the most comprehensive timeline. It situates all the books among all the episodes chronologically, even though there could be some entries that are listed as "occurring between" two other entries rather than having a specific stardate. But for a visuals-based application that gives an overview, I could use interpolation to put such entries on a timeline and that would be good enough.
As for the books, I've been able to generalize my code enough to work on 400 books I've tested so far, out of about 750 total. Most of them actually were pretty easy and didn't take much work, but wow did some of the edge cases take a frustrating amount of regex! Once I can find time to get through the rest, I can start to compile some interesting data and analysis results. Data prep is always 90% of the job. But it's more than halfway there already!
That's sound really interesting. I'm looking forward to learning about the final results!
Update: I've successfully parsed all ~750 books! The hard part is done. Next I'll be thinking about some interesting things to do with the data. But I already have a decent metadata table containing a number of columns (fields like title, author, publisher, date, number of chapters, words, characters, etc.) and one row for each book. This is all books (that I know of) through late 2017. I haven't picked up any of the newer ones yet. This dataset can be updated in the future from time to time as I get more books.
Doesn't look pretty in the gist. Just download the gist zip to get the file.
The table could use a little more cleaning up, it's not perfect. So I may continue to tweak it a bit, but not much. But considering how messy the source ebooks' formatting were (such as poor or inconsistent book naming conventions), I'd consider this table good enough to call complete. I used a lot of regex to make things much more consistent than in the original files.
As time permits, I will continue with analysis of the much larger table of book texts from which I will derive more interesting variables.
The data looks good. Actually I think that number of chapters, number of words, and number of characters is something that could easily go into STAPI. Not sure about the dedication though, I think that's too much of a raw material for this kind of API.
It's cool that you have have extracted the original publishing date, and not the ebook publishing date, I think the former is more relevant.
I need some way of matching your data set with the entries in STAPI, so I have question about the ISBN's. Are those ISBN's of ebooks, or of the paper books, or mixed?
I'm not saying I will jump right into it integrating your data, because I have some other work, but it will eventually be included in database model, and later in API model.
I agree, you should drop the dedication
field. I also had a "historian's note" field since a number of books include those, but it was not parsing correctly. But those things are not useful for STAPI. Also, you should drop the file
column. I only included it in what I shared because in some confusing/ambiguous cases, the filename might provide some differentiating information you could use before you remove it. Only keep what is relevant to the API.
There are a small number of potential duplicate entries I have not cleaned up yet. For example, the Strange New Worlds series has a book that appears several times under the same ISBN, but may actually show a different number of characters, meaning there is technically something different about the book but I don't know what yet. In other cases, there can be a true duplicate where two rows list the same ISBN and have the same number of characters in the book, so it's safe to assume they truly are the same, but it might show a slightly different title or one entry might list multiple authors and the other lists only one author.
I cleaned up some cases of this by removing obvious duplicate raw files that had slightly different names and for which it was clear one book had superior metadata. But I know there are some remaining cases I have not handled yet. If you eventually have a programmatic way of including what pieces of this data you wish to include in STAPI, you should be relatively safe to assume that if I provide a cleaner, updated, more complete csv later, it will have the same exact format. Only some rows will be cleaned up and/or new rows added.
I don't know if all the publishing dates I was able to parse are original first publishing dates. The very old ones may be. Not certain on the newer books. I am hoping they are all originals. I don't know if the ISBNs are specific to the e-books. I would think they are, especially since some slight duplicate files had different ISBNs.
Cross-checking these table entries with human eyes would be a good thing if we could find more volunteers who maybe are not programmers but don't mind looking these things up. You and I definitely do not have time for all that.
I have plenty of other stuff to do too for now... :) Matt
I have an update! :) My rtrek
library now includes more data and previous data has been updated and cleaned up a lot. I've also added functionality that imports the Memory Beta chronology/timeline data- the complete timeline or subsets of it. If any of this data sounds interesting to you regarding STAPI please let me know.
If you're interested in details you can find out more about the available data sets as well as the Memory Alpha/Beta API functions here. The 'Articles' menu now also has some examples of the type of content these functions return.
The Memory Alpha- and Memory Beta-related functions scrape html from some pages to return basic content. However, for my Memory Beta timeline function, it is possible to read a large number of pages so I have enforced a 1-second minimum gap between individual requests (like I do with my STAPI wrapper). This makes it much slower but is more polite. I am curious if you have any suggestions about this since you mentioned also pulling data from Memory Alpha. All this new code is on GitHub now (dev) but I have not formally published the new version yet (outside GitHub). I will do that once I am comfortable everything is stable.
But since Memory Alpha and Memory Beta do not offer APIs, and I have to rely on web page scraping to import any data, I have no idea how often they may restructure their site pages. In fact, this happened recently while I was in the middle of writing my code. I had do do a small rewrite to fix it. I do not know how frequently this may occur and break functionality. Hopefully not every week!
Doesn't seems that they change structure a lot, not the structure of the structured data anyway. Many articles are old enough and big enough that the way their paragraphs are structured does not change as well. It can happen, of course, but I don't remember having much trouble with it during development.
Anyway I'm happy you are progressing fast. Unfortunately can't spare resources to develop new features in STAPI right now. A priority for me would be to dockerize it and publish dockerized version, and that won't happen in a immediate future.
Closing. This is a very old conversation.
Hi,
I am developing an R package (library) called rtrek for the R programming language, popular among data scientists. The website accompanying the GitHub repo can be found here. One component of the package is a function for users to make API calls to STAPI. Please feel free to take a look and let me know what you think.
I am really excited to see this API. It offers a ton of potential for other production surrounding it. I would love to contribute, but my background is in statistical programming and the best contribution I could continue to make would be to further develop domain-specific analytics tools that can make productive use of STAPI data.
I read through your documentation and appreciate the alpha-state of current development and that there could be breaking changes. I also understand your team has limited resources for development and running the API. I want to let you know that my package has anti-DDOS measures built into it limiting users to no more than one request per second, as I read somewhere in your docs as a suggestion. Though I would not expect a huge amount of traffic via this client overall.
Regards, Matt