allenday / nanostream-dataflow

real-time stream processing of DNA nanopore sequencer reads with dataflow
MIT License
27 stars 9 forks source link

avoiding CGI #99

Open lachlancoin opened 5 years ago

lachlancoin commented 5 years ago

I am sure you considered this, but why not use the FastQsplitter to split into batches->write to bucket -> send location to AlignerCluster via pubsub -> AlignerCluster has a script which kicks of bwa alignment -> send location of bam via pubsub-> dataflow proceeds ?

I guess the reason is that this becomes asyncrhonous (ie a different dataflow process has to be running to split the fastq, and then another to read the bam). Is it possible to have these two asynchronous processes running on different threads within Dataflow? Or indeed to have two dataflow jobs running (one splitting, and one processing the BAM).

allenday commented 5 years ago

It's possible to do as you describe yes.

We implemented as it is now to have a single pipeline that contains all of the application logic.

Other than avoiding CGI, is there an advantage to having two distinct dataflows in the proposed architecture?

On a related note, we are exploring having a cluster that communicates with dataflow (or anything else) via pubsub. GCS fastq in, GCS sam out. This also enables e.g. variant calling using the same pattern. I began implementing a POC, I can give you what I have if you'd like to work on it.

On Tue, Mar 5, 2019, 08:32 lachlancoin notifications@github.com wrote:

I am sure you considered this, but why not use the FastQsplitter to split into batches->write to bucket -> send location to AlignerCluster via pubsub -> AlignerCluster has a script which kicks of bwa alignment -> send location of bam via pubsub-> dataflow proceeds ?

I guess the reason is that this becomes asyncrhonous (ie a different dataflow process has to be running to split the fastq, and then another to read the bam). Is it possible to have these two asynchronous processes running on different threads within Dataflow? Or indeed to have two dataflow jobs running (one splitting, and one processing the BAM).

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/allenday/nanostream-dataflow/issues/99, or mute the thread https://github.com/notifications/unsubscribe-auth/AAanP0Cu82JWo0gD71D-30vpDmPJ_3mIks5vTbslgaJpZM4bdf2f .

lachlancoin commented 5 years ago

The advantage is just avoiding CGI (which as I understand adds to complexity in terms of adding load balancers etc), and also is the step where the pipeline gets blocked for us. The pattern you describe sounds great.

On Tue, 5 Mar 2019 at 11:33, Allen Day notifications@github.com wrote:

It's possible to do as you describe yes.

We implemented as it is now to have a single pipeline that contains all of the application logic.

Other than avoiding CGI, is there an advantage to having two distinct dataflows in the proposed architecture?

On a related note, we are exploring having a cluster that communicates with dataflow (or anything else) via pubsub. GCS fastq in, GCS sam out. This also enables e.g. variant calling using the same pattern. I began implementing a POC, I can give you what I have if you'd like to work on it.

On Tue, Mar 5, 2019, 08:32 lachlancoin notifications@github.com wrote:

I am sure you considered this, but why not use the FastQsplitter to split into batches->write to bucket -> send location to AlignerCluster via pubsub -> AlignerCluster has a script which kicks of bwa alignment -> send location of bam via pubsub-> dataflow proceeds ?

I guess the reason is that this becomes asyncrhonous (ie a different dataflow process has to be running to split the fastq, and then another to read the bam). Is it possible to have these two asynchronous processes running on different threads within Dataflow? Or indeed to have two dataflow jobs running (one splitting, and one processing the BAM).

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/allenday/nanostream-dataflow/issues/99, or mute the thread < https://github.com/notifications/unsubscribe-auth/AAanP0Cu82JWo0gD71D-30vpDmPJ_3mIks5vTbslgaJpZM4bdf2f

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/allenday/nanostream-dataflow/issues/99#issuecomment-469498266, or mute the thread https://github.com/notifications/unsubscribe-auth/AD01ZG69LVIJ4s6vbhJqiOvZMeCgW8_1ks5vTclqgaJpZM4bdf2f .

-- Group leader, Institute for Molecular Bioscience, University of Queensland Senior Lecturer, Imperial College http://academickarma.org/0000-0002-4300-455X http://orcid.org/0000-0002-4300-455X

lachlancoin commented 5 years ago

Hi @allenday @obsh @Pseverin

I am wondering whether its possible to have a cut-down dataflow pipeline which carves out the post-bam processing, and does not do any of the fastq processing.

On our end we are working on the minimap2 docker which will process independently any fastq arriving in the UPLOAD_BUCKET (via a mounting of the cloud bucket on the instance), and produce a bam file, which could then go into the cut-down dataflow pipeline. I should point out that we can control how finely the fastq are split from the nanopore device, and we have a client-side script which is watching for new fastq and syncing those to the UPLOAD_BUCKET. So its not completely necessary to split further on GCP side.

Another advantage of this is that we could test the post-bam processing independently of the fastq processing steps. At the moment we are getting stuck in the alignment step

lachlancoin commented 5 years ago

There are a few more advantages to this setup.

  1. we can control upstream processing more easily ,e.g. compression/decompression or encryption/deencryption

  2. I think we could hack minimap2 to continue working on subsequent fastqs uploaded while its still processing

obsh commented 5 years ago

I think there are two main options, we can add another class with a cut-down pipeline that subscribes to BAM/SAM files upload events or we can extend the existing pipeline to detect uploaded file type and by-pass not needed steps, like if it's BAM/SAM file uploaded - by-pass alignment step.

obsh commented 5 years ago

We haven't worked with a bam files previously, am I correct that .bam files are always created with a corresponding index file .bam.bai?

lachlancoin commented 5 years ago

we dont need bai in this case as we need to read whole bam

On Fri, 8 Mar 2019, 06:57 Alexander Bushkovsky notifications@github.com wrote:

We haven't worked with a bam files previously, am I correct that .bam files are always created with a corresponding index file .bam.bai?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/allenday/nanostream-dataflow/issues/99#issuecomment-470691132, or mute the thread https://github.com/notifications/unsubscribe-auth/AD01ZKtBE8gaUFclJ2mJY8ktkhJwZDLjks5vUX01gaJpZM4bdf2f .

obsh commented 5 years ago

I've created separate branch bam_files with pipeline version which skips alignment steps for BAM/SAM files. But I believe there will be errors on the k-align step if your pipeline has difficulties connecting to the alignment cluster.

lachlancoin commented 5 years ago

Yes I see, so we still need to provision an alignment cluster, but probably could do so with less memory.

Also, I dont believe the k-align step is necessary for the species typing (and not sure its currently required for AMR typing in your pipeline either, although it is handy in the AMR pipeline in order to get high base level accurate sequences at the end, but we not currently exploiting that in the pipeline at the moment, we just using the counts)

On Fri, 8 Mar 2019 at 09:24, Alexander Bushkovsky notifications@github.com wrote:

I've created separate branch bam_files with pipeline version which skips alignment steps for BAM/SAM files. But I believe there will be errors on the k-align step if your pipeline has difficulties connecting to the alignment cluster.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/allenday/nanostream-dataflow/issues/99#issuecomment-470742236, or mute the thread https://github.com/notifications/unsubscribe-auth/AD01ZE_9bB0pREdPHDAdfYjXypi_l_Hbks5vUZ-8gaJpZM4bdf2f .

-- Group leader, Institute for Molecular Bioscience, University of Queensland Senior Lecturer, Imperial College http://academickarma.org/0000-0002-4300-455X http://orcid.org/0000-0002-4300-455X