Open mit-mit opened 4 years ago
I think package:stagehand
readme's "Usage" section does contain content that is similar to stereotypical package's example tab: https://pub.dev/packages/stagehand#usage
Similarly, they have an Installing section that is a subset of our Installing tab.
Maybe we should do a top-level analysis of the readme, and if the section has a recognized title (e.g. installing, setup, usage, example use) then we can use that and do not require the separate file?
I think
package:stagehand
readme's "Usage" section does contain content that is similar to stereotypical package's example tab: https://pub.dev/packages/stagehand#usage Similarly, they have an Installing section that is a subset of our Installing tab.Maybe we should do a top-level analysis of the readme, and if the section has a recognized title (e.g. installing, setup, usage, example use) then we can use that and do not require the separate file?
Hey, I would like to work on this. Could you point me in a direction from where I could get started?
Hey, I would like to work on this. Could you point me in a direction from where I could get started?
@ramyak-mehra: we are using package:markdown to parse the .md
file content. It has an AST, and it would be nice to process that AST to extract the hierarchical section structure of a document. From that on we could do not only table-of-contents but also this Usage
extraction.
@isoos So far I have come up with something like this (not refined)
List<String> _recognizedTitles = ['installing', 'setup', 'usage', 'example'];
var document = Document();
var markdown = '';
var lines = markdown.replaceAll('\r\n', '\n').split('\n');
var htmlLines = HtmlRenderer().render(document.parseLines(lines)).split('\n');
_extract(htmlLines);
bool _extract(List<String> htmlLines) {
htmlLines.forEach((element) {
if (_checkIfTitle(element)) {
return true;
}
});
return false;
}
bool _checkIfTitle(String content) {
_recognizedTitles.forEach((element) {
if (content.contains(element)) {
return true;
}
});
return false;
}
We can use this here We should also check if the title is a heading or not probably using regex
@ramyak-mehra: Code like this may be good for a large number of text content, but in general we try to recognise the structure from the parsed syntax tree. One example of such processing is the current changelog updater code: https://github.com/dart-lang/pub-dev/blob/master/app/lib/shared/markdown.dart#L322-L358
We would like to see a generic processing similar to that, which would extract the hierarchical structure of the markdown (in typed classes), and then decide the content extraction based on that structure.
@isoos If I am understanding it correctly we should have some kind of iterable or list in the hierarchical order of the markdown which has elements in typed classes such as different classes for heading, paragraph, etc and from that, we can make the decision?
@ramyak-mehra: I'm thinking more in a tree, like:
class Section {
final int level;
final markdown.Node titleNode;
final List<markdown.Node> contentNodes;
List<Section> children;
}
Maybe further methods to extract the text content of titleNode
and also to format contentNodes
+ optionally children to HTML.
@isoos I was doing something like this github gist . Probably not the best approach and I found this node visitor but I was not sure if its the right way to go, I explored it a bit but was unable to fully understand it.
@ramyak-mehra: as a quick look, I think this code is very early stage, and possible won't handle use case like this:
## section-2
Content of section-2.
#### section-4
Content of section-4.
Which should result in the structure of:
Section(level: 2, titleNode: <... /*section-2*/ ...>, contentNodes: <...>, children: [
Section(level: 4, titleNode: <... /*section-4*/ ...>, contentNodes: <...>),
]);
As you can see, the level
is not the level of the tree node, rather the level of the section title (eg. h2
in html will be level: 2
. Also the sections should contain their logical content embedded...
@ramyak-mehra: as a quick look, I think this code is very early stage, and possible won't handle use case like this:
## section-2 Content of section-2. #### section-4 Content of section-4.
Which should result in the structure of:
Section(level: 2, titleNode: <... /*section-2*/ ...>, contentNodes: <...>, children: [ Section(level: 4, titleNode: <... /*section-4*/ ...>, contentNodes: <...>), ]);
As you can see, the
level
is not the level of the tree node, rather the level of the section title (eg.h2
in html will belevel: 2
. Also the sections should contain their logical content embedded...
It was just a starting point for me to move forward. I have one doubt for h1 section of multiple h2s are children or h2 , h3 ,h4 ... h6 are children
@isoos wrote this script to make sections from a parsed markdown gist This breaks when content is found before any heading. How to handle that case. Also, what would be the next steps? Analise titleNodes on specific keywords. What would be the keywords?
Some of our analysis guidelines make no sense for packages which are tools (e.g. stagehand), for example here's an issue with a tool loosing points over not having an example file: https://github.com/dart-lang/stagehand/issues/638#issue-505545391