dart-lang / pub-dev

The pub.dev website
https://pub.dev
BSD 3-Clause "New" or "Revised" License
786 stars 147 forks source link

Custom analysis for packages which are tools #3657

Open mit-mit opened 4 years ago

mit-mit commented 4 years ago

Some of our analysis guidelines make no sense for packages which are tools (e.g. stagehand), for example here's an issue with a tool loosing points over not having an example file: https://github.com/dart-lang/stagehand/issues/638#issue-505545391

isoos commented 4 years ago

I think package:stagehand readme's "Usage" section does contain content that is similar to stereotypical package's example tab: https://pub.dev/packages/stagehand#usage Similarly, they have an Installing section that is a subset of our Installing tab.

Maybe we should do a top-level analysis of the readme, and if the section has a recognized title (e.g. installing, setup, usage, example use) then we can use that and do not require the separate file?

ramyak-mehra commented 3 years ago

I think package:stagehand readme's "Usage" section does contain content that is similar to stereotypical package's example tab: https://pub.dev/packages/stagehand#usage Similarly, they have an Installing section that is a subset of our Installing tab.

Maybe we should do a top-level analysis of the readme, and if the section has a recognized title (e.g. installing, setup, usage, example use) then we can use that and do not require the separate file?

Hey, I would like to work on this. Could you point me in a direction from where I could get started?

isoos commented 3 years ago

Hey, I would like to work on this. Could you point me in a direction from where I could get started?

@ramyak-mehra: we are using package:markdown to parse the .md file content. It has an AST, and it would be nice to process that AST to extract the hierarchical section structure of a document. From that on we could do not only table-of-contents but also this Usage extraction.

ramyak-mehra commented 3 years ago

@isoos So far I have come up with something like this (not refined)

List<String> _recognizedTitles = ['installing', 'setup', 'usage', 'example'];
var document = Document(); 
var markdown = '';
var lines = markdown.replaceAll('\r\n', '\n').split('\n');  
var htmlLines = HtmlRenderer().render(document.parseLines(lines)).split('\n');
_extract(htmlLines);

bool _extract(List<String> htmlLines) {
  htmlLines.forEach((element) {
    if (_checkIfTitle(element)) {
      return true;
    }
  });
  return false;
}

bool _checkIfTitle(String content) {
  _recognizedTitles.forEach((element) {
    if (content.contains(element)) {
      return true;
    }
  });
  return false;
}

We can use this here We should also check if the title is a heading or not probably using regex

isoos commented 3 years ago

@ramyak-mehra: Code like this may be good for a large number of text content, but in general we try to recognise the structure from the parsed syntax tree. One example of such processing is the current changelog updater code: https://github.com/dart-lang/pub-dev/blob/master/app/lib/shared/markdown.dart#L322-L358

We would like to see a generic processing similar to that, which would extract the hierarchical structure of the markdown (in typed classes), and then decide the content extraction based on that structure.

ramyak-mehra commented 3 years ago

@isoos If I am understanding it correctly we should have some kind of iterable or list in the hierarchical order of the markdown which has elements in typed classes such as different classes for heading, paragraph, etc and from that, we can make the decision?

isoos commented 3 years ago

@ramyak-mehra: I'm thinking more in a tree, like:

class Section {
  final int level;
  final markdown.Node titleNode;
  final List<markdown.Node> contentNodes;
  List<Section> children;
}

Maybe further methods to extract the text content of titleNode and also to format contentNodes + optionally children to HTML.

ramyak-mehra commented 3 years ago

@isoos I was doing something like this github gist . Probably not the best approach and I found this node visitor but I was not sure if its the right way to go, I explored it a bit but was unable to fully understand it.

isoos commented 3 years ago

@ramyak-mehra: as a quick look, I think this code is very early stage, and possible won't handle use case like this:

## section-2

Content of section-2.

#### section-4

Content of section-4.

Which should result in the structure of:

Section(level: 2, titleNode: <... /*section-2*/ ...>, contentNodes: <...>, children: [
  Section(level: 4, titleNode: <... /*section-4*/ ...>, contentNodes: <...>),
]);

As you can see, the level is not the level of the tree node, rather the level of the section title (eg. h2 in html will be level: 2. Also the sections should contain their logical content embedded...

ramyak-mehra commented 3 years ago

@ramyak-mehra: as a quick look, I think this code is very early stage, and possible won't handle use case like this:

## section-2

Content of section-2.

#### section-4

Content of section-4.

Which should result in the structure of:

Section(level: 2, titleNode: <... /*section-2*/ ...>, contentNodes: <...>, children: [
  Section(level: 4, titleNode: <... /*section-4*/ ...>, contentNodes: <...>),
]);

As you can see, the level is not the level of the tree node, rather the level of the section title (eg. h2 in html will be level: 2. Also the sections should contain their logical content embedded...

It was just a starting point for me to move forward. I have one doubt for h1 section of multiple h2s are children or h2 , h3 ,h4 ... h6 are children

ramyak-mehra commented 3 years ago

@isoos wrote this script to make sections from a parsed markdown gist This breaks when content is found before any heading. How to handle that case. Also, what would be the next steps? Analise titleNodes on specific keywords. What would be the keywords?