Closed jfbourke closed 1 year ago
Since you made PlainTextReader.DisposeAsync()
virtual, could you add to the method?
GC.SuppressFinalize(this);
public async ValueTask DisposeAsync()
{
GC.SuppressFinalize(this);
}
Updated PlainTextReader.DisposeAsync() to clr suppress finalizer.
@jfbourke Thanks for implementing this, I don't want to do this myself so that others still get their chance to contribute :smile:
Is there a reason you went with writing to a temporary file, which is then read by the PlainTextReader
?
Where you do
private static string ExtractContent(string zipPath)
{
using (ZipArchive archive = ZipFile.OpenRead(zipPath))
{
foreach (ZipArchiveEntry entry in archive.Entries)
{
if (entry.FullName.Equals("content.xml", StringComparison.OrdinalIgnoreCase))
{
using(var writer = new StreamWriter(DestinationPath, true, Encoding.UTF8, 4096))
{
using(var reader = new XmlTextReader(entry.Open()))
{
while(reader.Read())
{
if (reader.NodeType == XmlNodeType.Text ||
reader.NodeType == XmlNodeType.CDATA)
{
writer.WriteLine(reader.ReadContentAsString());
}
}
}
}
return DestinationPath;
}
}
}
throw new Exception($"Unable to load content.xml from '{zipPath}'");
}
Would it not also be doable to just implement ILineReader
and return reader.ReadContentAsString()
?
public async IAsyncEnumerable<string?> ReadLineAsync([EnumeratorCancellation] CancellationToken cancellation = default)
{
using (ZipArchive archive = ZipFile.OpenRead(this.ZipPath))
{
foreach (ZipArchiveEntry entry in archive.Entries)
{
if (entry.FullName.Equals("content.xml", StringComparison.OrdinalIgnoreCase))
{
using(var reader = new XmlTextReader(entry.Open()))
{
while(await reader.ReadAsync())
{
if (reader.NodeType == XmlNodeType.Text ||
reader.NodeType == XmlNodeType.CDATA)
{
yield return await reader.ReadContentAsStringAsync();
}
}
}
yield break;
}
}
}
throw new Exception($"Unable to load content.xml from '{this.ZipPath}'");
}
I don't see where the overhead of reading, writing, and reading again is worth it
@GStefanowich agree, will refactor it now.
The OpenDocumentTextReader will extract content.xml from the .ODT, parse the file and then write the text into a temporary file which is the passed into PlainTextReader for further parsing for email addresses.