StampyAI / alignment-research-dataset

Stampy's copy of Alignment Research Dataset scraper
https://huggingface.co/datasets/StampyAI/alignment-research-dataset
MIT License
9 stars 7 forks source link

Arbital refactor #174

Closed Thomas-Lemoine closed 1 year ago

Thomas-Lemoine commented 1 year ago

modified a few functions, especially markdownify_text, so that summaries are saved as summaries rather than as part of the text

Thomas-Lemoine commented 1 year ago

I fetched a bunch of arbital articles to check that it worked correctly and it seems fine, but occasionally there is stuff like:

The reasoning for an instrumental convergence claim says that for many utility functions $U_k$ and situations $S_i$ a $U_k$-consequentialist in situation $S_i$ will probably find some best policy $\pi_k = \underset{\pi_i \in \Pi}{\operatorname{argmax}}  \mathbb E [U_k | S_i, \pi_i ](https://arbital.com/p/)$ that happens to be inside the partition $X$.  If instead in situation $S_k$...

Where some (https://arbital.com/p/) comes out of nowhere

Thomas-Lemoine commented 1 year ago

my guess is that this would also be a problem for the current dataset though, since I don't see how that behaviour would have changed with my new code. basically formulas that use brackets will confuse the parser

mruwnik commented 1 year ago

That's mathjax code. It should be fine. Might even be worth adding mathjax to the chatbot and seeing if it can generate pretty equations?

Thomas-Lemoine commented 1 year ago

That's mathjax code. It should be fine. Might even be worth adding mathjax to the chatbot and seeing if it can generate pretty equations?

You might be misunderstanding what I meant when I quoted that result. somewhere in that answer, which corresponds to this I think: https://arbital.com/p/instrumental_convergence/ there's a usage of mathjax code, but our parsers sees it as brackets and treats it like a link ([123 ], but since the content of the brackets is not in that desired format, the parser messes up. It tries to create <a href="link">title</a>, I think? I'm confused.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/Thomas-Lemoine"><img src="https://avatars.githubusercontent.com/u/43831409?v=4" />Thomas-Lemoine</a> commented <strong> 1 year ago</strong> </div> <div class="markdown-body"> <p>One thing I'm considering is that if <code>parse_arbital_link</code> tries to create a link taht is empty (ie <code>(https://arbital.com/p/)</code>), we just ignore that, close the bracket as though it were a real bracket, and go on our merry way. </p> <p>Except, it seems like there are two cases that idk how we could distinguish: Either there's math stuff in brackets, in which case we want to keep the brackets as is and show the contents of the brackets, OR there's non-finished links, ie stuff like [pseudoconsequentialist], where someone links to a page not yet created in the hopes that it gets created in the future and can be automatically replaced by a real link. In those cases, we probably want to remove the brackets. Does that make sense?</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/Thomas-Lemoine"><img src="https://avatars.githubusercontent.com/u/43831409?v=4" />Thomas-Lemoine</a> commented <strong> 1 year ago</strong> </div> <div class="markdown-body"> <p><code>For some more flavorful examples of this method of using Bayes' rule, see [https://www.gwern.net/docs/statistics/1994-falk The ups and downs of the hope function in a fruitless search].</code> becomes <code>For some more flavorful examples of this method of using Bayes' rule, see [The ups and downs of the hope function in a fruitless search](https://arbital.com/p/https://www.gwern.net/docs/statistics/1994-falk).</code></p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/Thomas-Lemoine"><img src="https://avatars.githubusercontent.com/u/43831409?v=4" />Thomas-Lemoine</a> commented <strong> 1 year ago</strong> </div> <div class="markdown-body"> <p>handled an additional edge case for <code>parse_arbital_link</code></p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/Thomas-Lemoine"><img src="https://avatars.githubusercontent.com/u/43831409?v=4" />Thomas-Lemoine</a> commented <strong> 1 year ago</strong> </div> <div class="markdown-body"> <p>I'll make another PR that changes entries given to make_data_entry; I guess it'll be a <code>summaries</code>: List[str] then? it feels like it used to be that though, so do you remember why you switched it back? I suppose each entry being given a "summary" seems a bit more intuitive than giving it a list of many summaries, but also if we allow for an article to have many summaries, we might as well let them all be created from the same entry.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/mruwnik"><img src="https://avatars.githubusercontent.com/u/3942390?v=4" />mruwnik</a> commented <strong> 1 year ago</strong> </div> <div class="markdown-body"> <p>Most things that have summaries (e.g. arxiv or the alignment newsletter) only have a single summary, so it was easier to do it that way. Which of course breaks here :D </p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/Thomas-Lemoine"><img src="https://avatars.githubusercontent.com/u/43831409?v=4" />Thomas-Lemoine</a> commented <strong> 1 year ago</strong> </div> <div class="markdown-body"> <p>That makes sense. I suppose I can just replace "summary" key with "summaries" and the string with [string], but yeah that makes it a bit less obvious to use. alternatively, the "summary" key might take either List[str] | str, and then it checks if it's an instance of str or of list and decides accordingly, and that's maybe most flexible but maybe also more confusing or error prone? Not sure. Or, two keys, one "summary" and one "summaries", the summary one has str and summaries has List[str]; we assume only one is given; and in <code>make_data_entry</code>, the "summary" string is turned into a List[str], <code>[string]</code>, appended to "summaries" or wtv, and then it's more flexible but has two different keys depending on how you want to pass in the summary, either as a list of strings or a singular string</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/mruwnik"><img src="https://avatars.githubusercontent.com/u/3942390?v=4" />mruwnik</a> commented <strong> 1 year ago</strong> </div> <div class="markdown-body"> <p>I'd go with checking for both keys, and even going so far as to join them if both are provided</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/Thomas-Lemoine"><img src="https://avatars.githubusercontent.com/u/43831409?v=4" />Thomas-Lemoine</a> commented <strong> 1 year ago</strong> </div> <div class="markdown-body"> <p>yeah, makes sense. Will add that shortly</p> </div> </div> <div class="page-bar-simple"> </div> <div class="footer"> <ul class="body"> <li>© <script> document.write(new Date().getFullYear()) </script> Githubissues.</li> <li>Githubissues is a development platform for aggregating issues.</li> </ul> </div> <script src="https://cdn.jsdelivr.net/npm/jquery@3.5.1/dist/jquery.min.js"></script> <script src="/githubissues/assets/js.js"></script> <script src="/githubissues/assets/markdown.js"></script> <script src="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.4.0/build/highlight.min.js"></script> <script src="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.4.0/build/languages/go.min.js"></script> <script> hljs.highlightAll(); </script> </body> </html>